SamEdwardes/spacypdfreader

Fails to import PDF document

Closed this issue · 12 comments

arky commented

Unable to import this PDF document using spacypdfreader. The import results in high cpu usage and caused the system to hang.

https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road_Executive_Summary.pdf

Can you please provide a reproducible example and include the output?

arky commented

@SamEdwardes Here is simple testcase I have used. https://gist.github.com/arky/c91d20a8769846aec32262c76eea815d The issue surfaces when you are processing a pdf with large number of pages.
The program runs forever taking all available CPU.

Thank you @arky. Is there any way you are able to share the specific code and output related to the issue? Your gist refers to "test.pdf", and the PDF you shared in your first message is 3 pages.

arky commented

@SamEdwardes I was able to reproduce the error using the following file as test-case
Code snippet: https://gist.github.com/arky/c91d20a8769846aec32262c76eea815d
Test-case: https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road__Insights_from_a_new_global_dataset_of_13427_Chinese_development_projects.pdf
However any PDF with sufficiently large number of pages should be able to generate similar problems.

Unfortunately I wasn't able to reproduce any debug logs as the process becomes unresponsive.

Thank you for providing the updated PDF. The new PDF is 166 pages. Here is the code I ran:

import requests
import spacy
from rich import print
from spacypdfreader import pdf_reader

# download the pdf
url = 'https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road__Insights_from_a_new_global_dataset_of_13427_Chinese_development_projects.pdf'
r = requests.get(url, stream=True)

with open('test.pdf', 'wb') as f:
    f.write(r.content)

# Load the PDF document
nlp = spacy.load('en_core_web_sm')
doc = pdf_reader('test.pdf', nlp)

# View the results
page_count = doc._.last_page
for page in range(1, page_count + 1):
    print(page)
    print(doc._.page(page)[0:50])

This code did execute for me. But it took 4 minutes and 49 seconds.

Agreed - it is very slow. Unfortunately, PDF to text in general is slow. I have a few ideas:

I also have an open issue to implement multiprocessing. This would likely help speed things up (#8).

arky commented

Thank you @SamEdwardes for doing the research. Perhaps for now adding a note about correctly handling large sized documents could be added into the documentation.

Good suggestion thank you Arky!

arky commented

@SamEdwardes Touching base to see if we could resolve this issue either with implementation of multiprocessing or by expanding the docs as stop-gap measure.

Thanks!

@arky thank you for the reminder! I can make an update to the docs today!

arky commented

@SamEdwardes You are most welcome, please let me know if I could help in any way.

I adding a tip to the docs: 6d9f5b7

I think we can close this issue now.

arky commented

Thank you @SamEdwardes