Error comes with some short phrases
Benja1972 opened this issue · 3 comments
Benja1972 commented
I have strange error for this small example
import spacy
nlp = spacy.blank('en')
nlp.add_pipe('dbpedia_spotlight', config={'confidence': 0.4})
txt = 'one must keep the working memory footprint'
doc = nlp(txt)
Error which I get is as follow,
ValueError: [E1010] Unable to set entity information for token 5 which is included in more than one span in entities, blocked, missing or outside.
MartinoMensio commented
Hi @Benja1972 ,
Thanks for opening the issue. What is happening in your case, is that with your inputs DBPedia spotlight is finding two entities which are overlapping in terms of words:
- working memory: http://dbpedia.org/resource/Memory_footprint
- memory footprint: http://dbpedia.org/resource/Memory_footprint
SpaCy, on the other side, does not allow overlapping spans to be set in thedoc.ents
and therefore throws this error.
What you can do in this case is to turn off the default optionoverwrite_ents
of this library, which avoids this exception to be raised. The outputs of dbpedia spotlight will only be saved indoc.spans['dbpedia_spotlight']
(or in another span group which can be customised by passing the config argumentspan_group
when initiating the pipeline stage.
import spacy
txt = 'one must keep the working memory footprint'
nlp = spacy.blank('en')
# disable overwriting the doc entities
nlp.add_pipe('dbpedia_spotlight', config={'confidence': 0.4, 'overwrite_ents':False})
doc = nlp(txt)
print(doc.spans['dbpedia_spotlight'])
# additionally you can also specify the name of the span group to be used
nlp = spacy.blank('en')
nlp.add_pipe('dbpedia_spotlight', config={'confidence': 0.4, 'overwrite_ents':False, 'span_group': 'foo'})
doc = nlp(txt)
print(doc.spans['foo'])
Let me know if this works for you!
Best,
Martino
Benja1972 commented
Hi @MartinoMensio ,
Thank you for clarification. I will try this approach.
Best regards
Sergei
MartinoMensio commented
@Benja1972 this is now also solved by #8 without requiring extra configuration.
Best,
Martino