An implementation of the Stanford Multi-Pass Sieve Coreference System.
Get the repository and install the required packages:
$ git clone https://github.com/andreasvc/dutchcoref.git
$ cd dutchcoref
$ pip3 install -r requirements.txt
Unless you are working on an already parsed corpus, you will want to install the Alpino parser.
To get parse tree visualizations in the HTML output, install https://github.com/andreasvc/disco-dop/
Clone this repository under same parent folder as this repository:
~/code/dutchcoref $ cd ..
~/code $ git clone https://bitbucket.org/robvanderg/groref.git
Download the data from http://www.lsi.upc.edu/~esapena/downloads/index.php?id=1
Apply the Unicode fix in fixsemeval2010.sh
.
The directory data/semeval2010NLdevparses
contains Alpino parses for the
Dutch development set of this task.
See coref.py --help
for command line options.
For best results, use the Alpino tokenizer with numbered paragraphs.
Preprocess text such that two linebreaks \n\n
indicate a paragraph break:
$ cat example.txt
'Ik ben de directeur van Fecalo, van hierachter,' zei hij. 'Mag
ik u iets vragen?'
Ik vroeg hem binnen te komen.
Tokenize:
$ $ALPINO_HOME/Tokenization/paragraph_per_line example.txt \
| $ALPINO_HOME/Tokenization/add_key | $ALPINO_HOME/Tokenization/tokenize.sh \
| $ALPINO_HOME/Tokenization/number_sents >example.tok
$ cat example.tok
1-1|' Ik ben de directeur van Fecalo , van hierachter , ' zei hij .
1-2|' Mag ik u iets vragen ? '
2-1|Ik vroeg hem binnen te komen .
Parse and perform coreference resolution:
$ mkdir example
$ cat example.tok | Alpino number_analyses=1 end_hook=xml -parse -flag treebank example
[...]
$ python3 coref.py --fmt=booknlp /tmp/example
#begin document (example);
example 1-1 0 ' ' LET() 5 punct - 14 - B -
example 1-1 1 Ik ik VNW(pers,pron,nomin,vol,1,ev) 5 nsubj - 14 - I (0)
example 1-1 2 ben zijn WW(pv,tgw,ev) 5 cop - 14 - I -
example 1-1 3 de de LID(bep,stan,rest) 5 det - 14 - I (0
example 1-1 4 directeur directeur N(soort,ev,basis,zijd,stan) 0 root - 14 - I 0
example 1-1 5 van van VZ(init) 7 case - 14 - I 0
example 1-1 6 Fecalo Fecalo N(eigen,ev,basis,zijd,stan) 5 nmod ORG 14 - I 0)|(1)
example 1-1 7 , , LET() 5 punct - 14 - I -
example 1-1 8 van van VZ(init) 10 case - 14 - I -
example 1-1 9 hierachter hierachter BW() 5 nmod - 14 - I -
example 1-1 10 , , LET() 5 punct - 14 - I -
example 1-1 11 ' ' LET() 5 punct - 14 - I -
example 1-1 12 zei zeggen WW(pv,verl,ev) 5 parataxis - - - O -
example 1-1 13 hij hij VNW(pers,pron,nomin,vol,3,ev,masc) 13 nsubj - - - O (0)
example 1-1 14 . . LET() 5 punct - - - O -
example 1-2 0 ' ' LET() 6 punct - 14 - B -
example 1-2 1 Mag mogen WW(pv,tgw,ev) 6 aux - 14 - I -
example 1-2 2 ik ik VNW(pers,pron,nomin,vol,1,ev) 6 nsubj - 14 - I (0)
example 1-2 3 u u VNW(pers,pron,nomin,vol,2b,getal) 6 iobj - 14 - I (5)
example 1-2 4 iets iets VNW(onbep,pron,stan,vol,3o,ev) 6 obj - 14 - I -
example 1-2 5 vragen vragen WW(inf,vrij,zonder) 0 root - 14 - I -
example 1-2 6 ? ? LET() 6 punct - 14 - I -
example 1-2 7 ' ' LET() 6 punct - 14 - I -
example 2-1 0 Ik ik VNW(pers,pron,nomin,vol,1,ev) 2 nsubj - - - O (6)
example 2-1 1 vroeg vragen WW(pv,verl,ev) 0 root - - - O -
example 2-1 2 hem hem VNW(pers,pron,obl,vol,3,ev,masc) 2 iobj - - - O (0)
example 2-1 3 binnen binnen VZ(fin) 6 compound:prt - - - O -
example 2-1 4 te te VZ(init) 6 mark - - - O -
example 2-1 5 komen binnen_komen WW(inf,vrij,zonder) 2 xcomp - - - O -
example 2-1 6 . . LET() 2 punct - - - O -
#end document
For debugging purposes, enable verbose output:
The base system is purely rule-based, but there are optional neural modules that can be enabled. The modules are:
mentionspanclassifier.py
mentionfeatureclassifier.py
pronounresolution.py
qaclassifier.py
These modules can be trained (run above scripts without arguments to get help), or you can use the trained models made available on the releases tab. The modules need to be enabled from the command line:
$ pip3 install -r requirements-neural.txt
$ wget https://github.com/andreasvc/dutchcoref/releases/download/v0.1/models.zip
$ unzip models.zip
$ python3 coref.py --neural=span,feat,pron mydocument/ >output.conll
The quote attribution classifier is enabled with --neural=quote
and the models are available at https://github.com/frenkvdberg/dutchqa
The web demo accepts short pieces of text, takes care of parsing, and presents
a visualization of coreference results. Requires a running instance of
alpiner.
Run with python3 web.py
Get the evaluation tool: https://github.com/ns-moosavi/coval
$ python3 coref.py mydocument/ >output.conll
$ python3 ../coval/scorer.py mydocument.conll output.conll
recall precision F1
mentions 90.52 81.43 85.73
muc 79.44 74.43 76.85
bcub 51.72 55.65 53.61
ceafe 66.64 46.58 54.83
lea 49.48 52.74 51.05
CoNLL score: 61.77
IMPORTANT: by default the output will follow the dutchcoref annotation guidelines. To get output following the Corea/SoNaR annotation guidelines:
$ python3 coref.py --excludelinks=reflexives --exclude=relpronouns,relpronounsplit mydocument/ >output.conll
UPDATE: use https://github.com/andreasvc/berkeley-coreference-analyser
The following creates lots of output; scroll to the end for the error analysis. Mention boundaries and links are printed in green if they are correct, yellow if in gold but missing from output, and red if in output but not in gold.
$ ls mydocument/
1.xml 2.xml [...]
$ python3 coref.py mydocument/ --gold=mydocument.conll --verbose | less -R
alternatively, use the HTML visualization and view the results in your favorite browser:
$ python3 coref.py mydocument/ --gold=mydocument.conll --verbose --fmt=html >output.html
See https://andreasvc.github.io/voskuil.html for an example of the HTML visualization.
By default, output is written to standard out in CoNLL2012 format.
With --fmt=booknlp
the output contains the following columns:
- Document label
- Sentence ID
- Token number within sentence
- Token
- Lemma
- Rich POS tag (including morphological features)
- UD parent token (ID as in column 3)
- UD dependency label
- Named entity class (PER, ORG, LOC, ...)
- Speaker ID (if a speaker is found, every token in a direct speech utterance is assigned the speaker ID; the ID is the cluster ID of the speaker)
- Similar as above, but for addressee.
- Whether token is part of direct speech (B, I) or not (O)
- Coreference cluster in CoNLL notation
For the UD conversion, you need alud.
Make sure to set a $GOPATH
and add $GOPATH/bin
to your $PATH
, e.g.:
export GOPATH=$HOME/.local/go
export PATH=$GOPATH/bin/:$HOME/.local/bin:$PATH
Use the option --outputprefix
to dump information
on clusters, mentions, links and quotations:
$ python3 coref.py mydocument/ --fmt=booknlp --outputprefix=output
This creates the files output.{mentions,clusters,links,quotes}.tsv
(tabular format),
output.conll
(format specified by --fmt
), and output.icarus
(ICARUS allocation format).
Make sure you don't overwrite the gold standard conll file!
- Preprocess, tokenize and parse a text with Alpino to get a directory of parse trees in XML files.
- Run coreference resolution on the parse trees:
python3 coref.py path/to/parses/text/ > text.conll
(Forward slashes are required, also on Windows). - Get the latest stable release of CorefAnnotator.
Run it with e.g.,
java -jar CorefAnnotator-1.9.2-full.jar
- Import the
.conll
file (CoNLL 2012 button under "Import from other formats"). - Read the annotation guidelines in this repository
- Correct the annotation; save regularly (in the .xmi format used by CorefAnnotator)
- When done, export to CoNLL 2012 format
- The CoNLL 2012 file exported by CorefAnnotator does not contain POS tags and parse trees;
to add those, run
addparsebits.py alpino text.conll path/to/parses/text/
If you use this code for research, please cite the following paper:
@article{vancranenburgh2019coref,
author={van Cranenburgh, Andreas},
title={A {Dutch} coreference resolution system with an evaluation on literary fiction},
journal={Computational Linguistics in the Netherlands Journal},
year={2019}, volume={9}, pages={27--54},
url={https://clinjournal.org/clinj/article/view/91},
}
If you use the neural modules, cite this paper:
@inproceedings{vancranenburgh2021hybrid,
author={van Cranenburgh, Andreas and Ploeger, Esther and van den Berg, Frank and Th{\"u}ss, Remi},
title={A Hybrid Rule-Based and Neural Coreference Resolution System with an Evaluation on {D}utch Literature},
year={2021}, booktitle={Proceedings of CRAC}, pages={47--56},
url={https://aclanthology.org/2021.crac-1.5},
}
This code base is a Dutch implementation of the Stanford Multi-Pass Sieve Coreference System for English:
Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. Deterministic coreference resolution based on entity-centric, precision-ranked rules. Computational Linguistics, 39 (4):885–916, 2013. http://aclweb.org/anthology/J13-4004.pdf
See also these previous implementations https://bitbucket.org/robvanderg/groref and https://github.com/antske/coref_draft
The number & gender data is derived from:
Shane Bergsma and Dekang Lin (2006). Bootstrapping Path-Based Pronoun Resolution, In Proceedings of COLING/ACL. https://www.aclweb.org/anthology/P11-1079 Data: https://cemantix.org/conll/2012/data.html
The Dutch first names dataset Top_eerste_voornamen_NL_2010.csv
is based on:
De Nederlandse Voornamenbank (The Dutch First name bank) by Meertens Instituut KNAW. http://www.meertens.knaw.nl/nvb
More recent versions of this dataset have been made available, but to ensure reproducibility, the 2010 version is kept in this repository.