SamEdwardes/spacypdfreader

Loss of token/document tensor at least with PDFMiner

omarbenhamid opened this issue · 2 comments

Hello,

Thank you for this useful library !

The issue

I had the following issue, with the following code :

import spacy
from spacypdfreader import pdf_reader

nlp = spacy.load("fr_core_news_sm")
doc = pdf_reader('9.PADD_SCOT RM.pdf', nlp)
doc.tensor

I get an empty tensor.

Wheras :

import spacy
from pdfminer import high_level

nlp = spacy.load("fr_dep_news_trf")
doc = nlp(high_level.extract_text(path))
doc.tensor

Returns the right tensor.

Reason

The issue seems to comes from the fact that pdf_reader processess each page as a document and uses Doc.from_docs. It turns out that Doc.from_docs does not preserve Doc.tensor (but it is not found).

Hi omarbenhamid - thank you for creating this issue and looking the problems. I have never encountered this use case, but your explanation makes sense.

The reason each page is processed as a document is so that spacypdfreader can create the page attributes:

  • token._.page_number
  • doc._.page_range
  • doc._.first_page
  • doc._.last_page
  • doc._.pdf_file_name
  • doc._.page(int)

In your use case - do you still require the page number attributes? I think there are a few options:

  1. Update spacypdfreader so that it re-runs at least some of the NLP pipeline after using Doc.from_docs so that the doc object has a tensor, but without overwriting the page number attribute (I am not sure yet how to actually do this, but I imagine it can be done)
  2. Add a parameter to spacypdfreader.pdf_reader that will allow not add the page number attributes and instead run the NLP on the entire text at once. This would be a similar result to your example above.

Please let me know if you have any other ideas or suggestions?

Hello SamEdwardes
I opened a discussion with guys at Explosion about behaviour of Doc.from_docs , they are thinking about whether they will fix it in spaCy directly.

Discussion is here : explosion/spaCy#10597

Let's wait and see if they come with a solution.

I worked around the issue from my side by using PDFMiner directly, but I lose the page information in fact ...