Archived as the course has concluded.
This repository contains the code for an NLP project focussing on morphology
. Specifically, plural inflection
in German
and Turkish
. The project was made as part of a course at the University of Groningen.
We make use of a ByT5
character level model taken from Huggingface to gauge its inflectional capabilities.
The necessary requirements.txt
can be found in the root folder.
For both languages (separately) we do the following:
-
Compare a
pre-trained ByT5 model finetuned on language data
with aByT5 trained on language data from scratch
.- Compare the learning curves of the models
- Look into the types of errors
-
Analyse the
finetuned ByT5 models
usingfeature attribution methods
to see whether there are any patterns regarding the importance of input characters for the output characters.- For feature attribution we make use of the
Inseq
library.
- For feature attribution we make use of the
./
├── byt5.ipynb
├── byt5_learning_curves_finetuning.ipynb
├── byt5_learning_curves_scratch.ipynb
├── byt5_model.py
├── create_plots.ipynb
├── data/
│ ├── deu.dev
│ ├── deu.gold
│ ├── deu.test
│ ├── deu_100.train
│ ├── deu_200.train
│ ├── deu_300.train
│ ├── deu_400.train
│ ├── deu_500.train
│ ├── deu_600.train
│ ├── tur.dev
│ ├── tur.gold
│ ├── tur.test
│ ├── tur_large.train
│ └── tur_small.train
├── error_analysis.ipynb
├── generated_words/
│ ├── generated_words_ger.csv
│ └── generated_words_tur.csv
├── inseq_german.ipynb
├── inseq_turkish.ipynb
├── plot_utils.py
├── README.md
└── requirements.txt
The repository directory structure consists of a root
directory with a data/
directory.
Within the root we find the fundamental $CODE.py
files for the larger codebase used within the $ANALYSIS.ipynb
. The $ANALYSIS.ipynb
files contain the base analyses for the (1) model comparison and generation (byt5_*), (2) learning curve comparison plots (3) error_analysis, and (4) Inseq analyses (inseq_*).
The data/
directory contains the datasets that are used for training (data/*.train
) and generation (data/*.gold
) for the analyses.
The package dependencies can be found in requirements.txt.