Fails to import PDF document

Question

Fails to import PDF document

Closed this issue a year ago · 12 comments

Unable to import this PDF document using spacypdfreader. The import results in high cpu usage and caused the system to hang.

https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road_Executive_Summary.pdf

Answer 1 · 2022-10-03T20:22:42.000Z

Can you please provide a reproducible example and include the output?

Answer 2 · 2022-10-04T06:10:23.000Z

@SamEdwardes Here is simple testcase I have used. https://gist.github.com/arky/c91d20a8769846aec32262c76eea815d The issue surfaces when you are processing a pdf with large number of pages.
The program runs forever taking all available CPU.

Answer 3 · 2022-10-05T13:39:55.000Z

Thank you @arky. Is there any way you are able to share the specific code and output related to the issue? Your gist refers to "test.pdf", and the PDF you shared in your first message is 3 pages.

Answer 4 · 2022-10-05T13:43:59.000Z

@SamEdwardes I was able to reproduce the error using the following file as test-case
Code snippet: https://gist.github.com/arky/c91d20a8769846aec32262c76eea815d
Test-case: https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road__Insights_from_a_new_global_dataset_of_13427_Chinese_development_projects.pdf
However any PDF with sufficiently large number of pages should be able to generate similar problems.

Unfortunately I wasn't able to reproduce any debug logs as the process becomes unresponsive.

Answer 5 · 2022-10-05T14:05:08.000Z

Thank you for providing the updated PDF. The new PDF is 166 pages. Here is the code I ran:

import requests
import spacy
from rich import print
from spacypdfreader import pdf_reader

# download the pdf
url = 'https://docs.aiddata.org/ad4/pdfs/Banking_on_the_Belt_and_Road__Insights_from_a_new_global_dataset_of_13427_Chinese_development_projects.pdf'
r = requests.get(url, stream=True)

with open('test.pdf', 'wb') as f:
    f.write(r.content)

# Load the PDF document
nlp = spacy.load('en_core_web_sm')
doc = pdf_reader('test.pdf', nlp)

# View the results
page_count = doc._.last_page
for page in range(1, page_count + 1):
    print(page)
    print(doc._.page(page)[0:50])

This code did execute for me. But it took 4 minutes and 49 seconds.

Agreed - it is very slow. Unfortunately, PDF to text in general is slow. I have a few ideas:

Try a different parse, e.g. the Pytesseract parser: https://samedwardes.github.io/spacypdfreader/parsers/#pytesseract
Break the PDF into smaller chunks before passing into spacypdfreader

I also have an open issue to implement multiprocessing. This would likely help speed things up (#8).

Answer 6 · 2022-10-05T14:12:33.000Z

Thank you @SamEdwardes for doing the research. Perhaps for now adding a note about correctly handling large sized documents could be added into the documentation.

Answer 7 · 2022-10-05T14:14:04.000Z

Good suggestion thank you Arky!

Answer 8 · 2023-03-06T08:17:14.000Z

@SamEdwardes Touching base to see if we could resolve this issue either with implementation of multiprocessing or by expanding the docs as stop-gap measure.

Thanks!

Answer 9 · 2023-03-07T15:37:36.000Z

@arky thank you for the reminder! I can make an update to the docs today!

Answer 10 · 2023-03-07T15:58:12.000Z

@SamEdwardes You are most welcome, please let me know if I could help in any way.

Answer 11 · 2023-03-07T16:45:39.000Z

I adding a tip to the docs: 6d9f5b7

I think we can close this issue now.

Answer 12 · 2023-03-07T17:53:57.000Z

Thank you @SamEdwardes