This repository contains data and code implementations for reproducing all the experiments for:
Interpreting Neural Language Models for Linguistic Complexity Assessment, Gabriele Sarti, Data Science and Scientific Computing MSc Thesis, University of Trieste, 2020 [Gitbook] [Slides (Long)] [Slides (Short)]
UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations, Gabriele Sarti, Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, [ArXiv] CEUR Video
That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models, Gabriele Sarti and Dominique Brunato and Felice Dell'Orletta, Proceeding of the Workshop on Cognitive Modeling and Computational Linguistics at NAACL 2021 [ACL Anthology]
If you find these resource useful for your research, please consider citing one or more following works:
@mastersthesis{sarti-2020-interpreting,
author = {Sarti, Gabriele},
institution = {University of Trieste},
school = {University of Trieste},
title = {Interpreting Neural Language Models for Linguistic Complexity Assessment},
year = 2020
}
@inproceedings{sarti-2020-umbertomtsa,
author = {Sarti, Gabriele},
title = {{UmBERTo-MTSA @ AcCompl-It}: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations},
booktitle = {Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020)},
editor = {Basile, Valerio and Croce, Danilo and Di Maro, Maria, and Passaro, Lucia C.},
publisher = {CEUR.org},
year = {2020},
address = {Online}
}
@inproceedings{sarti-etal-2021-looks,
title = "That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models",
author = "Sarti, Gabriele and
Brunato, Dominique and
Dell'Orletta, Felice",
booktitle = "Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics",
month = jun,
year = "2021",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "TBD",
doi = "TBD",
pages = "TBD",
}
Prerequisites
-
Python >= 3.6 is required to run the scripts provided in this repository. Torch should be installed using the wheels available on the Pytorch website that are compatible with your CUDA version.
-
For CUDA 10 and Python 3.6, we used the wheel torch-1.3.0-cp36-cp36m-linux_x86_64.whl.
-
Python >= 3.7 is required to run SyntaxGym-related scripts.
Main dependencies
torch == 1.6.0
farm == 0.5.0
transformers == 3.3.1
syntaxgym
Setup procedure
python3 -m venv env
source env/bin/activate
pip install --upgrade pip
./scripts/setup.sh
Run scripts/setup.sh
from the main project folder. This will install dependencies, download data and create the repository structure. If you want to download ZuCo MAT files (30GB), edit setup.sh
setting DOWNLOAD_ZUCO_MAT_FILES=false
.
You need to manually download the original perceived complexity dataset presented in Brunato et al. 2018 from the ItaliaNLP Lab website and place it in the data/complexity
folder.
The AcCompl-IT campaign data and the Dundee corpus cannot be redistributed due to copyright restrictions.
After all datasets are in the respective folders, run python script/preprocess.py --all
from the main project folder to preprocess the datasets. Refer to the Getting Started section for further steps.
Repository structure
-
data
contains the subfolders for all data used throughout the study:complexity
: the Perceived Complexity corpus by Brunato et al. 2018.eyetracking
: Eye-tracking corpora (Dundee, GECO, ZuCo 1 & 2).eval
: SST dataset used for representational similarity evaluation.garden_paths
: three test suites taken from the SyntaxGym benchmark.readability
: OneStopEnglish corpus paragraphs by reading level.preprocessed
: The preprocessed versions of each corpus produced byscripts/preprocess.py
.
-
src/lingcomp
is the library built behind this work, composed by:data_utils
: Eye-tracking processors and utils.farm
: Custom extension of the FARM library to add token-level regression, better multitask learning for NLMs and the GPT-2 model.similarity
: Methods used for representational similarity evaluation.syntaxgym
: Methods used to perform evaluation over SyntaxGym test suites.
-
scripts
: Used to carry out the analysis and modeling experiment:shortcuts
: in development, scripts calling other scripts multiple times to provide a quick interface.analyze_linguistic_features
: Produces a report containing correlations across various complexity metrics and linguistic features.compute_sentence_baselines
: Computes sentence-level avg., binned avg. and SVM baselines for complexity scores using cross-validation.compute_similarity
: Evaluates the representational similarity of embeddings produced by neural language models using different methods.evaluate_garden_paths
: Allows using custom metrics (surprisal, gaze metrics prediction) to estimate the presence of atypical construction over SyntaxGym test suites.finetune_sentence_level
: Train NLMs on sentence-level regression or classification tasks in single or multi-task settings.finetune_token_regression
: Train NLMs on token-level regression in single or multi-task settings.get_surprisals
: Compute surprisal scores produced by NLMs for sentences.preprocess
: Performs initial preprocessing and train/test splitting.
Preprocessing
# Generate sentence-level dataset for eyetracking
python scripts/preprocess.py \
--all \
--do_features \
--eyetracking_mode sentence \
--do_train_test_split
If you have any questions, feel free to contact me through email (gabriele.sarti996@gmail.com) or raise a Github issue in the repository!