Deserialization is inconsistent for empty documents
zifeishan opened this issue · 2 comments
zifeishan commented
Issue body
import spacy.en
from spacy.tokens.doc import Doc
nlp = spacy.en.English()
doc = nlp('', tag=True, parse=True)
bytes = doc.to_bytes()
doc2 = Doc(nlp.vocab)
doc2.from_bytes(bytes)
Result:
>>> doc.is_parsed
True
>>> doc2.is_parsed
False
>>> [_ for _ in doc.sents]
[]
>>> [_ for _ in doc2.sents]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "spacy/tokens/doc.pyx", line 395, in __get__ (spacy/tokens/doc.cpp:9506)
ValueError: sentence boundary detection requires the dependency parse, which requires data to be installed. If you haven't done so, run:
python -m spacy.en.download all
to install the data
Your Environment
- Operating System: Linux
- Python Version Used: 3.5
- spaCy Version Used: latest pip release
- Environment Information:
honnibal commented
Thanks! Mixed feelings about my solution to this. I'm now considering empty docs to be parsed and tagged, because there's no information for a tagger or parser to add.
lock commented
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.