/interpreting-complexity

Materials for the MSc Thesis "Interpreting Neural Language Models for Linguistic Complexity Assessment" and related works.

Primary LanguagePythonApache License 2.0Apache-2.0

Interpreting Models of Linguistic Complexity

This repository contains data and code implementations for reproducing all the experiments for:

Interpreting Neural Language Models for Linguistic Complexity Assessment, Gabriele Sarti, Data Science and Scientific Computing MSc Thesis, University of Trieste, 2020 [Gitbook] [Slides (Long)] [Slides (Short)]

UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations, Gabriele Sarti, Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, [ArXiv] CEUR Video

That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models, Gabriele Sarti and Dominique Brunato and Felice Dell'Orletta, Proceeding of the Workshop on Cognitive Modeling and Computational Linguistics at NAACL 2021 [ACL Anthology]

If you find these resource useful for your research, please consider citing one or more following works:

@mastersthesis{sarti-2020-interpreting,
    author = {Sarti, Gabriele},
    institution = {University of Trieste},
    school = {University of Trieste},
    title = {Interpreting Neural Language Models for Linguistic Complexity Assessment},
    year = 2020
}

@inproceedings{sarti-2020-umbertomtsa,
    author = {Sarti, Gabriele},
    title = {{UmBERTo-MTSA @ AcCompl-It}: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations},
    booktitle = {Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020)},
    editor = {Basile, Valerio and Croce, Danilo and Di Maro, Maria, and Passaro, Lucia C.},
    publisher = {CEUR.org},
    year = {2020},
    address = {Online}
}

@inproceedings{sarti-etal-2021-looks,
    title = "That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models",
    author = "Sarti, Gabriele and
    Brunato, Dominique and
    Dell'Orletta, Felice",
    booktitle = "Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics",
    month = jun,
    year = "2021",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "TBD",
    doi = "TBD",
    pages = "TBD",
}

Overview

⚠️ TODO: Short summary and images ⚠️

Installation

Prerequisites

  • Python >= 3.6 is required to run the scripts provided in this repository. Torch should be installed using the wheels available on the Pytorch website that are compatible with your CUDA version.

  • For CUDA 10 and Python 3.6, we used the wheel torch-1.3.0-cp36-cp36m-linux_x86_64.whl.

  • Python >= 3.7 is required to run SyntaxGym-related scripts.

Main dependencies

  • torch == 1.6.0
  • farm == 0.5.0
  • transformers == 3.3.1
  • syntaxgym

Setup procedure

python3 -m venv env
source env/bin/activate
pip install --upgrade pip
./scripts/setup.sh

Run scripts/setup.sh from the main project folder. This will install dependencies, download data and create the repository structure. If you want to download ZuCo MAT files (30GB), edit setup.sh setting DOWNLOAD_ZUCO_MAT_FILES=false.

You need to manually download the original perceived complexity dataset presented in Brunato et al. 2018 from the ItaliaNLP Lab website and place it in the data/complexity folder.

The AcCompl-IT campaign data and the Dundee corpus cannot be redistributed due to copyright restrictions.

After all datasets are in the respective folders, run python script/preprocess.py --all from the main project folder to preprocess the datasets. Refer to the Getting Started section for further steps.

Code Overview

Repository structure

  • data contains the subfolders for all data used throughout the study:

    • complexity: the Perceived Complexity corpus by Brunato et al. 2018.
    • eyetracking: Eye-tracking corpora (Dundee, GECO, ZuCo 1 & 2).
    • eval: SST dataset used for representational similarity evaluation.
    • garden_paths: three test suites taken from the SyntaxGym benchmark.
    • readability: OneStopEnglish corpus paragraphs by reading level.
    • preprocessed: The preprocessed versions of each corpus produced by scripts/preprocess.py.
  • src/lingcomp is the library built behind this work, composed by:

    • data_utils: Eye-tracking processors and utils.
    • farm: Custom extension of the FARM library to add token-level regression, better multitask learning for NLMs and the GPT-2 model.
    • similarity: Methods used for representational similarity evaluation.
    • syntaxgym: Methods used to perform evaluation over SyntaxGym test suites.
  • scripts: Used to carry out the analysis and modeling experiment:

    • shortcuts: in development, scripts calling other scripts multiple times to provide a quick interface.
    • analyze_linguistic_features: Produces a report containing correlations across various complexity metrics and linguistic features.
    • compute_sentence_baselines: Computes sentence-level avg., binned avg. and SVM baselines for complexity scores using cross-validation.
    • compute_similarity: Evaluates the representational similarity of embeddings produced by neural language models using different methods.
    • evaluate_garden_paths: Allows using custom metrics (surprisal, gaze metrics prediction) to estimate the presence of atypical construction over SyntaxGym test suites.
    • finetune_sentence_level: Train NLMs on sentence-level regression or classification tasks in single or multi-task settings.
    • finetune_token_regression: Train NLMs on token-level regression in single or multi-task settings.
    • get_surprisals: Compute surprisal scores produced by NLMs for sentences.
    • preprocess: Performs initial preprocessing and train/test splitting.

Getting Started

Preprocessing

# Generate sentence-level dataset for eyetracking
python scripts/preprocess.py \
    --all \
    --do_features \
    --eyetracking_mode sentence \
    --do_train_test_split

⚠️ TODO: Examples for all experiments ⚠️

Contacts

If you have any questions, feel free to contact me through email (gabriele.sarti996@gmail.com) or raise a Github issue in the repository!