spaCy v2.0 extension and pipeline component for adding Named Entities metadata to Doc
objects. Detects Named Entities using dictionaries. The extension sets the custom Doc
, Token
and Span
attributes ._.is_entity
, ._.entity_type
, ._.has_entities
and ._.entities
.
Named Entities are matched using the python module flashtext
, and looks up in the data provided by different dictionaries.
spacy-lookup
requires spacy
v2.0.16 or higher.
First, you need to download a language model.
Import the component and initialise it with the shared nlp
object (i.e. an instance of Language
), which is used to initialise flashtext
with the shared vocab, and create the match patterns. Then add the component anywhere in your pipeline.
import spacy
from spacy_lookup import Entity
nlp = spacy.load('en')
entity = Entity(keywords_list=['python', 'product manager', 'java platform'])
nlp.add_pipe(entity, last=True)
doc = nlp(u"I am a product manager for a java and python.")
assert doc._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[3]._.entity_desc == 'product manager'
assert doc[3]._.is_entity == True
print([(token.text, token._.canonical) for token in doc if token._.is_entity])
spacy-lookup
only cares about the token text, so you can use it on a blank Language
instance (it should work for all available languages!), or in a pipeline with a loaded model. If you're loading a model and your pipeline includes a tagger, parser and entity recognizer, make sure to add the entity component as last=True
, so the spans are merged at the end of the pipeline.
The extension sets attributes on the Doc
, Span
and Token
. You can change the attribute names on initialisation of the extension. For more details on custom components and attributes, see the processing pipelines documentation.
Token._.is_entity |
|
|
Token._.entity_type |
|
|
Doc._.has_entities |
|
|
Doc._.entities |
|
|
Span._.has_entities |
|
|
Span._.entities |
|
|
On initialisation of Entity
, you can define the following settings:
nlp |
Language |
The shared nlp object. Used to initialise the matcher with the shared Vocab , and create Doc match patterns. |
attrs |
tuple | Attributes to set on the ._ property. Defaults to ('has_entities', 'is_entity', 'entity_type', 'entity') . |
`keywords_list | ` list |
|
`keywords_dict | ` dict |
|
`keywords_file | ` string |
|