A simplified multilingual coreference baseline model for CorefUD

Based on https://github.com/ufal/crac2022-corpipe

Setup

Fetch data.
```
cd data
chmod +x get.sh
./get.sh
```
- If you wish to reduce the number of languages, either edit the get.sh file accordingly, or delete/move folders.

Convert all data to jsonlines (if needed)

cd data_handling
chmod +x corefud_convert.sh
./corefud_convert.sh

Train the model in src/models/simple-corpipe. Example:

cd src/models/simple-corpipe
python train.py --langs germanic
# or with a specific language, e.g. no_bokmaalnarc:
python train.py --langs no_bokmaalnarc

This will run the "germanic languges" found in the following definitions:

romance_langs = "ca_ancora es_ancora fr_democrat".split()
germanic_langs = "de_parcorfull de_potsdamcc en_gum en_parcorfull no_bokmaalnarc no_nynorsknarc".split()
slavic_baltic_langs = "cs_pcedt cs_pdt pl_pcc lt_lcc ru_rucor".split()
urgic_turkic_langs = "hu_korkor hu_szegedkoref tr_itcc".split()

langs_dict = {
    "romance": romance_langs,
    "germanic": germanic_langs,
    "slavic": slavic_baltic_langs,
    "urgic": urgic_turkic_langs,
    "all": langs,
}

Omitting any args will default to "all", which requires all languages in the data folder.

Training configuration:

Argument	Default	Type	Description
--langs	[]	List[str]	Languages to train on.
--batch_size	16	int	Batch size.
--bert	xlm-roberta-base	str	Bert model.
--debug	False	bool	Debug mode.
--epochs	10	int	Number of epochs.
--exp	run	str	Exp name.
--label_smoothing	0.0	float	Label smoothing.
--learning_rate	2e-5	float	Learning rate.
--learning_rate_decay	False	bool	Decay LR.
--max_links	None	int	Max antecedent links to train on.
--right	50	int	Reserved space for right context, if any.
--seed	42	int	Random seed.
--segment	512	int	Segment size.
--train	[]	List[str]	Additional train data.
--warmup	0.1	float	Warmup ratio.

tollefj/CorefUD-baseline

A simplified multilingual coreference baseline model for CorefUD

Setup

Training configuration: