The repo contains:
- code for creating an adjusted token-level version of the Szeged Uncertainty Corpus (Szarvas et al. 20121);
- code for training and evaluating a CRF classifier, similar to the one trained by Szarvas et al. (2012).
For details, please refer to the wiki.
It is recommended to create a conda environment using the environment.yml file. This is done by running the command:
conda env create -f environment.yml
If you prefer to use pip
, you can find the names and versions of the required packages in environment.yml.
The adjusted version of the Szeged Uncertainty Corpus can be downloaded from here in a form of a pickled pandas DataFrame (szeged_fixed.pkl
, 172MB). For more information, refer to the 'Data' wiki page.
NOTE: My HEDGEhog repository contains a transformer-based model that performs the same multi-class classification task with better performance. The CRF model in this repo was used as a baseline to evaluate HEDGEhog.
If you want to run the CRF model on your own data, use the predict.py
script.
If you want to train your own CRF model, you can use the notebook train_multiclass_crf.ipynb
as an example.
1 Szarvas, G., Vincze, V., Farkas, R., Móra, G., & Gurevych, I. (2012). Cross-genre and cross-domain detection of semantic uncertainty. Computational Linguistics, 38(2), 335-367.