allenai/papermage

how to extract figures from the pdf?

PeterGriffinJin opened this issue · 3 comments

Hi there,

Thank you so much for the nice package!

Can I ask how to extract the figures from the pdf? I have tried:

recipe = CoreRecipe()
doc = recipe.run("papermage/tests/fixtures/2020.acl-main.447.pdf")
doc.figures

But it seems that this is not returning the figure data. Is the figure extraction achievable with your package?

Best,
Bowen

Hey @PeterGriffinJin Sorry looks like a bug; once this merges, should fix it thanks!
#73

Just merged #73. Here's me testing out the recipe locally on that PDF to get Figures:

import json
import os
import pathlib

from papermage.magelib import Document
from papermage.recipes import CoreRecipe
from papermage.visualizers.visualizer import plot_entities_on_page

# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent.parent / "tests/fixtures/2020.acl-main.447.pdf"
doc = recipe.from_pdf(pdf=pdfpath)
page_id = 0
figures = doc.pages[page_id].intersect_by_box("figures")
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)

image

image

I'm gonna close this for now, please re-open if it's not resolved, thankss!