This code accompanies the paper "Disambiguatory signals are stronger in word initial positions" published in EACL 2021.
To install dependencies run:
$ conda env create -f environment.yml
And then install the appropriate version of pytorch:
$ conda install -y pytorch torchvision cudatoolkit=10.1 -c pytorch
$ # conda install pytorch torchvision cpuonly -c pytorch
CELEX data can be obtained at https://catalog.ldc.upenn.edu/LDC96L14/. You can process it with the command:
$ make LANGUAGE=<language> DATASET=celex
Languages: eng, deu, nld.
NorthEuraLex data already comes with this repo. To preprocess it, run:
$ make LANGUAGE=<language> DATASET=northeuralex
with any language in NorthEuraLex, e.g. por
.
To get the wikipedia tokenized data use the code in the Wikipedia Tokenizer repository.
You can train your models using random search with the command
$ make LANGUAGE=<language> DATASET=<dataset>
There are three datasets available in this repository: celex; northeuralex; and wiki.
To train the model in all languages from one of the datasets, run
$ python src/h02_learn/train_all.py --dataset <dataset> --data-path data/<dataset>/
The names of the models in this repository differ from the paper. They are: norm
(Forward); rev
(Backward); cloze
(Cloze); unigram
(Unigram); position-nn
(Position-specific).
To make the first page plot (forward and backward surprisal plots) use command:
$ make plot_first_page
To get and print the p-values used in the statistical significance tests
$ make p_value MODEL=<model> DATASET=<dataset>
The command to plot Figures 2 and 3 is:
$ make plot_bin
the plots will be created in folder results/
. Finally, to print Tables 2 and 3 run:
$ make print_eow
$ make print_diffs
If this code or the paper were usefull to you, consider citing it:
@inproceedings{pimentel-etal-2021-disambiguatory,
title = "Disambiguatory signals are stronger in word initial positions",
author = "Pimentel, Tiago and
Cotterell, Ryan and
Roark, Brian",
booktitle = "Proceedings of the 16th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 1, Long Papers",
year = "2021",
publisher = "Association for Computational Linguistics",
}
To ask questions or report problems, please open an issue.