explosion/spaCy

KeyError when serializing a doc object after adding a new entity label

emsrc opened this issue · 3 comments

emsrc commented

I'm trying to add new entity labels and add new entity spans accordingly. However, this results in a KeyError when using doc.to_bytes(). Minimal code example below:

# python3 + spacy 0.101.0

import spacy

nlp = spacy.load('en')

doc = nlp('This is a sentence about pasta.')

label = 'Food'
nlp.entity.add_label(label)
label_id = nlp.vocab.strings[label]

print(label_id)

doc.ents = [(label_id, 5,6)]

print(doc.ents)

byte_string = doc.to_bytes()

Output:

6832
(pasta,)
Traceback (most recent call last):
  File "/Users/work/Projects/ScienceIE/scienceie17/exps/crf0/minimal.py", line 18, in <module>
    byte_string = doc.to_bytes()
  File "spacy/tokens/doc.pyx", line 418, in spacy.tokens.doc.Doc.to_bytes (spacy/tokens/doc.cpp:10687)
  File "spacy/serialize/packer.pyx", line 110, in spacy.serialize.packer.Packer.pack (spacy/serialize/packer.cpp:5687)
  File "spacy/serialize/huffman.pyx", line 61, in spacy.serialize.huffman.HuffmanCodec.encode (spacy/serialize/huffman.cpp:2535)
KeyError: 6832

Added a fix for this, but the situation's pretty messy. The serializer expects a list of attribute frequencies, so that it can build a Huffman tree. So it wants to know what entity labels are available, and how common they are. Once the Huffman trees are built, they can't be modified without changing the encoding.

The result is that if you serialize some documents, add an entity label, and then serialize some more, the two sets of documents won't be consistently encoded. So uh...don't do that :p.

I suggest trying to add your custom entity labels as soon as possible after loading the pipeline. That's probably the best way to work around the brittleness here, until the underlying design improves. The serializer is probably rather over-engineered.

emsrc commented

Got it. Thanks!

lock commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.