understand_xlm-r

Create Conda environment

Run the following commands:
conda env create --file xlm-r.yml
conda activate xlm-r

For pytorch with cpu only, run:
pip install transformers[torch]

Otherwise, for gpu support, first intall pytorch with desired cuda verson, then run: pip install transformers

Download Universal Dependencies datasets

Download zip file from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3424. Unzip into folder called UD2.7.
Run unzip_ud.py

Note: in our experiments, we assume that the UD folders are in the parent folder of the code folder.

Scripts for data preparation

The script xlm_roberta.py takes a PUD dataset file (or a file with the same format) and created a data structure containing hidden states and attention weights for all tokens in all sentences in the file. This script generates a pkl file that is used as input for the POS classifiers (diagnostic_classifiers_pos.py, pos_statistics.py) and for generating attention weights heatmaps (gen_heatmap.py).

The script xlm_roberta_multilingual_dicts.py takes a file with bilingual lexicons (two columns of parallel words in two languages, one word per line, tab-separated, top row is the name of the language) and creates a data structure, saved as a pkl file, containing hidden states and attention weights for each token. The argument 'individual' controls if each word should be regarded as an individual sentence (for experiment 4 part 2) or if all words in one language should be regarded as a single sentence (for experiment 4 part 3)

Experiment 1 - POS tags

diagnostic_classifiers_pos.py trains the diagnostic classifiers for POS tag prediction for a specific language, and for all 12 layers of the XLM-R base model. It takes as input a language and a type of POS tag (UPOS or XPOS). This must be run after xlm_roberta.py for the same language of interest. The script creates 6 pkl files: ground truth, raw predictions, accuracy scores by layer, f1 macro scores by layer, f1 micro scores by layer, f1 averaged scores by layer.

pos_statistics.py generates pkl files with POS tag counts for the languages of interest.

gen_pos_tables.py takes the predictions and tag counts generated by the two scripts above, and creates a csv file with F1 scores per POS tag, as well as POS tag counts.

gen_pos_plots.py plots the F1 scores by layer for the languages of interest.

Experiments 2 and 3 - attention weights heatmap

gen_heatmap.py generates heatmaps of attention weights for a specific word, given a sentence number and word number. Must be run after xlm_roberta.py.

For these experiments, we created files with modified sentences. The files are en_it_pud-ud-test.conllu and eh_pud-ud-test.conllu. For compatibility with our code, we put these files in the English-PUD folder of the UD treebank folder, but we include them here as well.

Experiment 4 - word alignments

gen_tsne.py creates tsne plots for parts 2 and 3. This must be run after xlm_roberta_multilingual_dicts.py.

gen_tsne_from_sentences.py creates tsne plots for part 1, and must be run after xlm_roberta.py, as it uses the same pkl file as the other experiments. It assumes the 1st sentence from the pkl file is being used. It takes as input two languages, as well as pairs of indices from the same sentence in the two languages. These pairs of indices are meant to identify equivalent words in the two translated sentences.