pdf_reader with PdfminerParser, page_numbers argument
Closed this issue · 4 comments
Hello,
could you please tell me what is wrong with below function. I would like to parse only first two pages of the pdf. When I call the function with argument page_numbers=[0,1] it extracts text from all pages anyway.
The function is very slow and I would like to limit number of pages parsed.
def spacy_extractor(label, pattern_name, list_name, pdf_path, pdf_name,
filtered_list,page_numbers):
patterns = [{'label': label, 'pattern': pattern_name} for pattern_name in list_name]
ruler.add_patterns(patterns)
doc = pdf_reader(os.path.join(pdf_path, pdf_name), nlp, PdfminerParser, page_numbers)
filtered_list = [ent.text for ent in doc.ents if ent.label_ == label]
return filtered_list[0] if filtered_list else None
cover_page_legal_form = spacy_extractor(label='LEG', pattern_name= 'legal_form', list_name=legal_form_list,
pdf_path=fs_path_pdf, pdf_name=fs_name_pdf, filtered_list='legal_forms_filtered',page_numbers=[0,1])
Thank you,
Hi @DoubleCortado - thank you for sharing the issue. Can you please share a reproducible example, including the error message? See this page for advice: https://stackoverflow.com/help/minimal-reproducible-example.
I did, however, take a shot at recreating the problem myself. Can you confirm if this is what you were seeing as well?
import spacy
from spacypdfreader import pdf_reader
nlp = spacy.load("en_core_web_sm")
# Extract all pages - works
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
print(doc._.page_range)
# (1, 4)
# Extract specific pages - will raise an error
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, page_numbers=[0, 1])
print(doc._.page_range)
Traceback (most recent call last):
File "/Users/samedwardes/projects/personal/spacypdfreader/test.py", line 11, in <module>
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, page_numbers=[0, 1])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samedwardes/projects/personal/spacypdfreader/spacypdfreader/spacypdfreader.py", line 158, in pdf_reader
text = pdf_parser(pdf_path=pdf_path, page_number=page_num, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samedwardes/projects/personal/spacypdfreader/spacypdfreader/parsers/pdfminer.py", line 60, in parser
text = extract_text(pdf_path, page_numbers=[page_number], **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: pdfminer.high_level.extract_text() got multiple values for keyword argument 'page_numbers'
This does look like a bug. This issue that I set the value for page_numbers
here:
I think the behaviour to only parse 1 page at a time is required to keep the multiprocessing simple. However, I can see if there is a way to have support for only extracting certain pages.
Hi @DoubleCortado - I have released a new version (0.3.1) that now supports a new parameter called page_range
. Could you updated to 0.3.1 and give it a try?
import spacy
from spacypdfreader import pdf_reader
from spacypdfreader.parsers import pytesseract
nlp = spacy.load("en_core_web_sm")
doc = pdf_reader(
"tests/data/test_pdf_01.pdf",
nlp,
page_range=(2, 3)
)
Hello. thank you for the update.
not sure why I could use the previous version of the package with python 3.12 and now when trying to update package to 0.3.1 I'm getting the error:
ERROR: Ignored the following versions that require a different python version: 0.3.0 Requires-Python >=3.8,<3.12; 0.3.1 Requires-Python >=3.8,<3.12
ERROR: Could not find a version that satisfies the requirement spacypdfreader==0.3.1 (from versions: 0.1.0, 0.1.1, 0.2.0, 0.2.1)
ERROR: No matching distribution found for spacypdfreader==0.3.1
Right now I only test against 3.8 to 3.11: https://github.com/SamEdwardes/spacypdfreader/blob/main/.github/workflows/pytest.yml
This is a good callout, though, python 3.12 should work as well. I can fix this in a future release. I added this issue to track: #21
For now, can you use an older version of Python?