Based on https://github.com/ufal/crac2022-corpipe
- Fetch data.
cd data chmod +x get.sh ./get.sh
- If you wish to reduce the number of languages, either edit the
get.sh
file accordingly, or delete/move folders.
- If you wish to reduce the number of languages, either edit the
- Convert all data to jsonlines (if needed)
cd data_handling chmod +x corefud_convert.sh ./corefud_convert.sh
- Train the model in
src/models/simple-corpipe
. Example:This will run the "germanic languges" found in the following definitions:cd src/models/simple-corpipe python train.py --langs germanic # or with a specific language, e.g. no_bokmaalnarc: python train.py --langs no_bokmaalnarc
Omitting any args will default to "all", which requires all languages in theromance_langs = "ca_ancora es_ancora fr_democrat".split() germanic_langs = "de_parcorfull de_potsdamcc en_gum en_parcorfull no_bokmaalnarc no_nynorsknarc".split() slavic_baltic_langs = "cs_pcedt cs_pdt pl_pcc lt_lcc ru_rucor".split() urgic_turkic_langs = "hu_korkor hu_szegedkoref tr_itcc".split() langs_dict = { "romance": romance_langs, "germanic": germanic_langs, "slavic": slavic_baltic_langs, "urgic": urgic_turkic_langs, "all": langs, }
data
folder.
Argument | Default | Type | Description |
---|---|---|---|
--langs | [] | List[str] | Languages to train on. |
--batch_size | 16 | int | Batch size. |
--bert | xlm-roberta-base | str | Bert model. |
--debug | False | bool | Debug mode. |
--epochs | 10 | int | Number of epochs. |
--exp | run | str | Exp name. |
--label_smoothing | 0.0 | float | Label smoothing. |
--learning_rate | 2e-5 | float | Learning rate. |
--learning_rate_decay | False | bool | Decay LR. |
--max_links | None | int | Max antecedent links to train on. |
--right | 50 | int | Reserved space for right context, if any. |
--seed | 42 | int | Random seed. |
--segment | 512 | int | Segment size. |
--train | [] | List[str] | Additional train data. |
--warmup | 0.1 | float | Warmup ratio. |