how to extract figures from the pdf?
PeterGriffinJin opened this issue · 3 comments
PeterGriffinJin commented
Hi there,
Thank you so much for the nice package!
Can I ask how to extract the figures from the pdf? I have tried:
recipe = CoreRecipe()
doc = recipe.run("papermage/tests/fixtures/2020.acl-main.447.pdf")
doc.figures
But it seems that this is not returning the figure data. Is the figure extraction achievable with your package?
Best,
Bowen
kyleclo commented
Hey @PeterGriffinJin Sorry looks like a bug; once this merges, should fix it thanks!
#73
kyleclo commented
Just merged #73. Here's me testing out the recipe locally on that PDF to get Figures:
import json
import os
import pathlib
from papermage.magelib import Document
from papermage.recipes import CoreRecipe
from papermage.visualizers.visualizer import plot_entities_on_page
# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent.parent / "tests/fixtures/2020.acl-main.447.pdf"
doc = recipe.from_pdf(pdf=pdfpath)
page_id = 0
figures = doc.pages[page_id].intersect_by_box("figures")
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)
kyleclo commented
I'm gonna close this for now, please re-open if it's not resolved, thankss!