[BUG] Dependency mismatch `PyPDF2` | llama_index ≥ 0.6.8 PDFReader dependency has changed to pypdf instead of PyPDF2
JoNilsson opened this issue · 3 comments
Describe the bug
ValueError: Could not load webpage
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/llama_index/readers/file/docs_reader.py", line 21, in load_data
import pypdf
ModuleNotFoundError: No module named 'pypdf'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/models/index_model.py", line 563, in set_file_index
index = await self.loop.run_in_executor(
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/site-packages/models/index_model.py", line 362, in index_file
document = SimpleDirectoryReader(input_files=[file_path]).load_data()
File "/usr/local/lib/python3.10/site-packages/llama_index/readers/file/base.py", line 192, in load_data
docs = reader.load_data(input_file, extra_info=metadata)
File "/usr/local/lib/python3.10/site-packages/llama_index/readers/file/docs_reader.py", line 23, in load_data
raise ImportError(
ImportError: pypdf is required to read PDF files: `pip install pypdf`
See this issue for more context.
https://github.com/jerryjliu/llama_index/issues/3735
PyPDF2
is deprecated.
See here: https://pypi.org/project/PyPDF2/
To Reproduce
Try to index a PDF = FAIL
Expected behavior
PDF's are indexable.
#333 reverts the PyPDF2
dependence. But seemed to introduce new issues related to method mismatch in lla
error generated below is after pulling merged changes from b0c4e26 and again attempting a PDF index.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/models/index_model.py", line 563, in set_file_index
index = await self.loop.run_in_executor(
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/site-packages/models/index_model.py", line 362, in index_file
document = SimpleDirectoryReader(input_files=[file_path]).load_data()
File "/usr/local/lib/python3.10/site-packages/llama_index/readers/file/base.py", line 192, in load_data
docs = reader.load_data(input_file, extra_info=metadata)
File "/usr/local/lib/python3.10/site-packages/llama_index/readers/file/docs_reader.py", line 38, in load_data
page_label = pdf.page_labels[page]
AttributeError: 'PdfReader' object has no attribute 'page_labels'
The offending line can be found here.
edit:
https://github.com/jerryjliu/llama_index/issues/6649
issue raised in llama_index
.
I don't have this issue, pypdf.PdfReader
contains the page_labels
attribute as seen here.
https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L1087-L1095
Which pdf are you using?
I don't have this issue,
pypdf.PdfReader
contains thepage_labels
attribute as seen here. https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L1087-L1095
I caught some Red Herring.
I'm not sure what I did wrong, but this AM pulled the repo into a fresh dir, rebuilt and redeployed and the issue seems resolved, unfortunately indexing this PDF is still erroring, but now unrelated to pypdf
.