/scrapy-corenlp

Primary LanguagePythonBSD 2-Clause "Simplified" LicenseBSD-2-Clause

scrapy-corenlp

PyPI PyPI

A Scrapy middleware to perform Named Entity Recognition (NER) on response with Stanford CoreNLP.

Settings

Option Value Example Value
STANFORD_NER_ENABLED Boolean True
STANFORD_NER_CLASSIFIER absolute path to CRFClassifier '/home/jithesh/stanford-ner-2015-12-09/classifiers/english.muc.7class.distsim.crf.ser.gz'
STANFORD_NER_JAR absolute path to stanford-ner.jar file '/home/jithesh/stanford-ner-2015-12-09/stanford-ner.jar'
STANFORD_NER_FIELD_TO_PROCESS A field or list of Item text fields to use for classification ['title', 'description']
STANFORD_NER_FIELD_OUTPUT scrapy item field to update the result with 'result'

In your settings.py file, add the previously described settings and add CoreNLP to your SPIDER_MIDDLEWARES, e.g.

SPIDER_MIDDLEWARES = {
    'scrapy_corenlp.middlewares.CoreNLP': 543,
}

An example value of the STANFORD_NER_FIELD_OUTPUT field after recognising the entities is:

{"result": {"DATE": ["1963", "2009", "1979", "1663", "1982"], "ORGANIZATION": ["Royal Society", "US National Academy of Science", "University of California", "Home Home About Stephen The Computer Stephen", "the University of Cambridge", "Sally Tsui Wong-Avery Director of Research", "Theoretical Physics", "Leiden University", "Baby Universe", "Department of Applied Mathematics", "Cambridge Lectures Publications Books Images Films", "Briefer History of Time", "ESA", "NASA", "Brief History of Time", "CBE", "Caius College", "The Universe"], "PERSON": ["P. Oesch", "Einstein", "D. Magee", "Stephen Hawking", "George", "Annie", "Isaac Newton", "G. Illingworth", "Dennis Stanton Avery", "R. Bouwens"], "LOCATION": ["London", "Santa Cruz", "Einstein", "Cambridge", "Gonville"]}}