explosion/spaCy

Deserialization is inconsistent for empty documents

zifeishan opened this issue · 2 comments

Issue body

import spacy.en
from spacy.tokens.doc import Doc
nlp = spacy.en.English()
doc = nlp('', tag=True, parse=True)
bytes = doc.to_bytes()
doc2 = Doc(nlp.vocab)
doc2.from_bytes(bytes)

Result:

>>> doc.is_parsed
True
>>> doc2.is_parsed
False
>>> [_ for _ in doc.sents]
[]
>>> [_ for _ in doc2.sents]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "spacy/tokens/doc.pyx", line 395, in __get__ (spacy/tokens/doc.cpp:9506)
ValueError: sentence boundary detection requires the dependency parse, which requires data to be installed. If you haven't done so, run:
python -m spacy.en.download all
to install the data

Your Environment

  • Operating System: Linux
  • Python Version Used: 3.5
  • spaCy Version Used: latest pip release
  • Environment Information:

Thanks! Mixed feelings about my solution to this. I'm now considering empty docs to be parsed and tagged, because there's no information for a tagger or parser to add.

lock commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.