In natural language processing, algorithms often require additional linguistic features (syntactic and semantic), such as part-of-speech, named entity, and dependency tags; information that is not readily available in most datasets. StAn provides a convenient way to quickly annotate an existing dataset with additional linguistic features computed by Stanford CoreNLP.
StAn either uses a local CoreNLP installation or an exisiting CoreNLP Server. To use a local installation, download and unpack the latest version from the Stanford CoreNLP website.
TBD
Clone the repository and run:
pip install [--editable] .
For example, the following command annotates the SemEval 2010 Task 8 relation extraction dataset with POS, NER, and dependency information and saves it in JSONL format.
stan \
--input-dir $INPUT_PATH/SemEval2010_task8_all_data/ \
--output-dir $OUTPUT_PATH/ \
--corenlp $PATH_TO_CORENLP_JAR_OR_SERVER_URL \
--input-format semeval2010task8 \
--output-format jsonl \
--shuffle \
--validation-size 0.1 \
--n-jobs 4
input-dir
: the directory containing the dataset or dataset files. StAn expects a specific structure for common datasets (e.g. SemEval 2010 Task 8). The format of the input is specified byinput-format
.output-dir
: the directory to store the annotated dataset. The format in which to save the dataset is specified byoutput-format
.corenlp
: the path to the directory containing the CoreNLP jar file or a url pointing to an exisiting CoreNLP server.input-format
: the format of the input dataset, can be one of "semeval2010task8", "json" or "jsonl".output-format
: the format of the output dataset, can be one of "tacred", "json", "jsonl".shuffle
: whether to shuffle the training dataset before splitting into train and validation (only if validation size > 0).validation-size
: if > 0, use avalidation-size
fraction of the training dataset for validation.n-jobs
: the number of threads to use for concurrent requests to CoreNLP.
Explain how to run the automated tests for this system
pytest -v tests/
mypy stan --ignore-missing-imports
- Stanford CoreNLP
- stanford-corenlp - Python wrapper for Stanford CoreNLP
- Christoph Alt
See also the list of contributors who participated in this project.
This project is licensed under the MIT License - see the LICENSE file for details