Future-House/paper-qa

[zotero] issue with item.num_pages

andreifoldes opened this issue · 1 comments

Hello,

Every now and again I get the following error message during my ingestation process. Maybe something is wrong with the pdf?

i=0
for item in zotero.iterate(start=129,limit=900):
    i+=1
    print("Adding", item.title, i)
    if item.num_pages > 30:
        continue  # skip long papers
    docs.add(item.pdf, docname=item.key)

output:

Traceback (most recent call last):

  Cell In[52], line 1
    for item in zotero.iterate(start=129,limit=900):

  File ~/anaconda3/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:257 in iterate
    num_pages=count_pdf_pages(pdf),

  File ~/anaconda3/envs/paperqa/lib/python3.11/site-packages/paperqa/utils.py:66 in count_pdf_pages
    num_pages = len(pdf_reader.pages)

  File ~/anaconda3/envs/paperqa/lib/python3.11/site-packages/pypdf/_page.py:2435 in __len__
    return self.length_function()

  File ~/anaconda3/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:456 in _get_num_pages
    self._flatten()

  File ~/anaconda3/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:1213 in _flatten
    catalog = self.trailer[TK.ROOT].get_object()

  File ~/anaconda3/envs/paperqa/lib/python3.11/site-packages/pypdf/generic/_data_structures.py:309 in __getitem__
    return dict.__getitem__(self, key).get_object()

KeyError: '/Root'

This looks to be a problem with https://github.com/py-pdf/pypdf. We just released version 5, which rewrites a lot of stuff and updates our dependencies. We actually no longer depend on pypdf, instead we use pymupdf.

As this issue is no longer relevant in the latest paper-qa, I am going to close this issue out. If your issue persists, please reopen a new issue using paper-qa>=5