HERCULES-EXTRACTION is a command line interface (CLI) program that can be used to extract named entities from a text using differents methods of translation, extraction, coreference resolution and exportation. Some CLI samples are available here.
The different components of this program can be used in a Jupyter Notebook by calling them directly without using the CLI. An example is available here.
To set up the program you must download the requirements by running:
pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html
Different components have different requirements see Components for additional details.
The CLI program can be used as shown here:
hercules-extraction.py [-h] [--file FILE] [--config CONFIG] [--out OUT] [text]
positional arguments:
text text to process (optional if `--file` is specified)
optional arguments:
-h, --help show this help message and exit
--file FILE text file to process (optional if `text` is specified)
--config CONFIG config file
--out OUT out file (exporting to stdout if not specified)
Note: config file is described here
The configuration file enables to change the different components that will be used by HERCULES-EXTRACTION.
The configuration file must follow this patern:
translation:
implementation: translator-implementation
language:
text: text-language
extraction: extraction-language
extraction:
implementation: extractor-implementation
coreference:
implementation: coreference-resolver-implementation
export:
implementation: exporter-implementation
language: exportation-language
namespaces:
entity: http://www.hercules.gouv.qc.ca/instances/entities/
ontology: http://www.cidoc-crm.org/cidoc-crm/
Where translation
and coreference
are optional.
All the implementation requirements and details can be found in the Components section.
If no configuration file is specified, the default one used by the program is the following:
translation:
implementation: googletrans
language:
text: fr
extraction: en
extraction:
implementation: google
export:
implementation: cidoccrm
language: turtle
namespaces:
entity: http://www.hercules.gouv.qc.ca/instances/entities/
ontology: http://www.cidoc-crm.org/cidoc-crm/
There are four main component types in HERCULES-EXTRACTION:
This component type allows to translate a text from the text-language
to the extraction-language
and to translate it back.
Any component of this type must inherit from Translator.
This translator uses the Azure Text API. To use this component, should use this configuration:
translation:
implementation: azure
This component requires to set the AZURE_TOKEN
environment variable to an Azure Text API key.
This translator uses the Google Translation Cloud API. To use this component, should use this configuration:
translation:
implementation: google
This component requires to set the GOOGLE_APPLICATION_CREDENTIALS
environment variable to a Google service account JSON keyfile.
This translator uses the Google Translation website. To use this component, should use this configuration:
translation:
implementation: googletrans
This translator uses the MyMemory API. To use this component, should use this configuration:
translation:
implementation: mymemory
This component requires to set the MYMEMORY_TOKEN
environment variable to a MyMemory API key.
This component type allows to extract named entities from a text.
Any component of this type must inherit from EntityExtractor.
This entity extractor uses the Dandelion API. To use this component, should use this configuration:
extraction:
implementation: dandelion
This component requires to set the DANDELION_TOKEN
environment variable to a Dandelion API key.
This entity extractor uses the Google Natural Language API. To use this component, should use this configuration:
extraction:
implementation: google
This component requires to set the GOOGLE_APPLICATION_CREDENTIALS
environment variable to a Google service account JSON keyfile.
Currently this component cannot be used with a Coreference Resolver.
This component type allows to bundle coreferenced named entities together from a text.
Any component of this type must inherit from CoreferenceResolver.
This coreference resolver uses a local intance of the Stanford CoreNLP server. To use this component, should use this configuration:
coreference:
implementation: stanford
Prerequisites:
- Java 1.8+
This component requires to:
- Download Stanford CoreNLP and models for the language you wish to use.
- Put the model jars in the distribution folder.
- Set the
CORENLP_HOME
environment variable to the path of the Stanford CoreNLP. Example:CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05
.
If you want this coreference resolver to run multiple times, you should consider starting the server beforehand so it does not need to restart the server at every step. To do so, you can use the following commands:
cd /path/to/stanford-corenlp-full-2018-10-05
java -mx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
To use an already started Stanford CoreNLP server you must specified the following config for the coreference resolver in the config file:
coreference:
implementation: stanford
stanford:
start_server: False
endpoint: "server endpoint"
Where "server endpoint"
could be "http://localhost:9000"
The program cannot start a server at runtime. Please follow Use an aldready started Stanford CoreNLP server.
This component type allows to export named entities in diffrent RDF format such as turtle, xml, n3, etc.
Any component of this type must inherit from Exporter.
This exporter is specifically crafted to work with the CIDOC CRM ontology.