[BUG] Dependency mismatch `PyPDF2` | llama_index ≥ 0.6.8 PDFReader dependency has changed to pypdf instead of PyPDF2

Question

[BUG] Dependency mismatch `PyPDF2` | llama_index ≥ 0.6.8 PDFReader dependency has changed to pypdf instead of PyPDF2

JoNilsson opened this issue a year ago · 3 comments

Describe the bug

ValueError: Could not load webpage
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/llama_index/readers/file/docs_reader.py", line 21, in load_data
    import pypdf
ModuleNotFoundError: No module named 'pypdf'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/models/index_model.py", line 563, in set_file_index
    index = await self.loop.run_in_executor(
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/site-packages/models/index_model.py", line 362, in index_file
    document = SimpleDirectoryReader(input_files=[file_path]).load_data()
  File "/usr/local/lib/python3.10/site-packages/llama_index/readers/file/base.py", line 192, in load_data
    docs = reader.load_data(input_file, extra_info=metadata)
  File "/usr/local/lib/python3.10/site-packages/llama_index/readers/file/docs_reader.py", line 23, in load_data
    raise ImportError(
ImportError: pypdf is required to read PDF files: `pip install pypdf`

See this issue for more context.
https://github.com/jerryjliu/llama_index/issues/3735

PyPDF2 is deprecated.

See here: https://pypi.org/project/PyPDF2/

To Reproduce

Try to index a PDF = FAIL

Expected behavior

PDF's are indexable.

Answer 1 · 2023-06-29T22:40:02.000Z

#333 reverts the PyPDF2 dependence. But seemed to introduce new issues related to method mismatch in lla
error generated below is after pulling merged changes from b0c4e26 and again attempting a PDF index.

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/models/index_model.py", line 563, in set_file_index
    index = await self.loop.run_in_executor(
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/site-packages/models/index_model.py", line 362, in index_file
    document = SimpleDirectoryReader(input_files=[file_path]).load_data()
  File "/usr/local/lib/python3.10/site-packages/llama_index/readers/file/base.py", line 192, in load_data
    docs = reader.load_data(input_file, extra_info=metadata)
  File "/usr/local/lib/python3.10/site-packages/llama_index/readers/file/docs_reader.py", line 38, in load_data
    page_label = pdf.page_labels[page]
AttributeError: 'PdfReader' object has no attribute 'page_labels'

The offending line can be found here.

https://github.com/jerryjliu/llama_index/blob/d394ffd5b57b976192f002f52fc9315401b4aa09/llama_index/readers/file/docs_reader.py#L38

edit:
https://github.com/jerryjliu/llama_index/issues/6649
issue raised in llama_index.

Answer 2 · 2023-06-30T05:19:52.000Z

I don't have this issue, pypdf.PdfReader contains the page_labels attribute as seen here.
https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L1087-L1095

Which pdf are you using?

Answer 3 · 2023-06-30T15:44:04.000Z

I don't have this issue, pypdf.PdfReader contains the page_labels attribute as seen here. https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L1087-L1095

I caught some Red Herring.
I'm not sure what I did wrong, but this AM pulled the repo into a fresh dir, rebuilt and redeployed and the issue seems resolved, unfortunately indexing this PDF is still erroring, but now unrelated to pypdf.