explosion/sense2vec

Is there any way to use "doc.spans" in 01_parse.py?

nonstoprunning opened this issue · 0 comments

Hi,
I am trying to built a sense2vec model with new data. I have made few changes in 01_parse.py.
First, I have removed the default ner pipe coming with "en_core_web_lg".
Then I have added a new Language.component where I identify Spans associated to a new entities (new labels) in a doc.
Sometimes, I would like to assign a Span[x, y] to more than one entity but I can not.
My question...
I have read the new changes in spaCy v3.1. Is there a way to use "doc.spans" (or something similar) in 01_parse where SpaCy's internal algorithms take Spans overlap into account?

@Language.component("name_comp")
def my_component(doc):
matches = matcher(doc)
seen_tokens = set()
new_entities = []
entities = doc.ents
for match_id, start, end in matches:
# check for end - 1 here because boundaries are inclusive
if start not in seen_tokens and end - 1 not in seen_tokens:
new_entities.append(Span(doc, start, end, label=match_id))
entities = [
e for e in entities if not (e.start < end and e.end > start)
]
seen_tokens.update(range(start, end))
doc.ents = tuple(entities) + tuple(new_entities)
return doc

Thanks in advance,
Paula