Official implementation for our EMNLP 2023 paper Seq2seq is All You Need For Coreference Resolution.
conda create -n seq2seq_coref python==3.8
conda activate seq2seq_coref
pip install -r requirements.txt
- Follow http://conll.cemantix.org/2012/data.html and https://catalog.ldc.upenn.edu/LDC2013T19 to obtain the files {train,dev,test}.english.v4_gold_conll.
- Download from PreCo official website or from Community.
- Follow https://github.com/dbamman/lrec2020-coref to obtain the 10-fold LitBank raw data.
python ./preprocess_scripts/preprocess_data.py \
--dataset_name [ontonotes, preco, litbank,corefpt] \
--input_dir [your raw data directory for dataset_name] \
--output_dir [your processed data directory for dataset_name] \
--language english \
--seg_lens 4096,2048 \
--num_cross_val_splits [10 for litbank, 1 for others]
For partial linearization with sentence markers, preprocess by
preprocess_data_mark_sentence.py
with the same above script.
Train/evaluate the model on each dataset
bash run_scripts/train.sh \
[input data directory] \
[model_name_or_path] \
[model save directory] \
[predict save directory] \
[logging directory] \
[seq2seq type: (action, short_seq, full_seq, tagging, input_feed)] \
[action type: (integer, non_integer)] \
[learning rate: (5e-4, 5e-5, 3e-5, 2e-5)] \
[num epochs: (100, 10)] \
[maximum target length: (4096, 2560, 6170)] \
[minimum num mentions per cluster: (2,1)] \
[eval every n steps: (800,3200,100)] \
[save every n steps: (800, 15200, 100)] \
[log every n steps: (100, 10)] \
[eval delay: eval after n steps (30000,1500)] \
[eval batch size: (1,2)]
Train/evaluate the joint model on the union of the three datasets by
bash run_scripts/train.sh \
[OntoNotes data directory] \
[PreCo data directory] \
[LitBank data directory] \
[model_name_or_path] \
[model save directory] \
[predict save directory] \
[logging directory] \
[seq2seq type: (action, short_seq, full_seq, tagging, input_feed)] \
[action type: (integer, non_integer)] \
[learning rate: (5e-4, 5e-5, 3e-5, 2e-5)]
- Check
train.sh
andtrain_joint.sh
for recommended hyperparameters and meaning of argument flags. - If you want to train the partial linearization model with sentence marker, set
--mark_sentence True
intrain.sh
andtrain_joint.sh
. - If you want to run evaluation without training, you can disable training by setting
--do_train False
in thetrain.sh
andtrain_joint.sh
and provide the trained model checkpoint path for[model_name_or_path]
.
We've uploaded model checkpoints to huggingface hub. You can download the following model checkpoints:
- T0-3b copy action full linearization model on OntoNotes checkpoint.
- T0-3b token action full linearization model on OntoNotes checkpoint.
- T0-3b partial linearization with sentence marker model on OntoNotes checkpoint.
- T0-3b integer-free copy action full linearization model on OntoNotes checkpoint.
- T0-3b integer-free add mention end copy action full linearization model on OntoNotes checkpoint.
- T0-3b copy action full linearization model jointly trained on the union of OntoNotes, PreCo and Litbank datasets checkpoint.
- T0pp copy action full linearization model on OntoNotes checkpoint.