How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation

Code and data for our CoNLL 2020 publication: "How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation"

Citation

@inproceedings{eger-etal-2020-probe,
    title = "How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation",
    author = "Eger, Steffen  and
      Daxenberger, Johannes  and
      Gurevych, Iryna",
    booktitle = "Proceedings of the 24th Conference on Computational Natural Language Learning",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.conll-1.8",
    pages = "108--118",
}

Structure

Our implementation is based on Senteval to train and evaluate classifiers, using a given sentence embedding. The following features were added:

Change the size of a dataset, while maintaining its balance
Change the balance between classes in a given dataset
Use the Random Forest and Naive Bayes classifiers from scikit-learn
Automatically tune hyperparameters for MLP and Random Forest
Train various sentence embeddings

We also added english datasets, as well as datasets in the languages:

Turkish (tr)
Russian (ru)
Georgian (ka)

The following probing and downstream tasks were added to SentEval:

Task	Type	Description	Example	Command Line Argument
Voice	Probing	Whether sent. contains a passive construct	He likes cats ⟶ False	Voice
Subject Verb Agreement	Probing	Whether subject and verb agree	They works together ⟶ Disagree	SubjVerbAgreement
Subject Verb Distance	Probing	Distance between subject and verb	The delivery was very late ⟶ 1	SubjVerbDistance
Argumentation Mining	Downstream	Whether sent. supports or opposes a given topic	(abortion, Abortion is basically murder!) ⟶ opposing	MArgMin
Sentiment Analysis	Downstream	Positive, negative or neutral sentiment	Never fails to disappoint ⟶ NEG	MSenti

Installation

(Optional): Set up a virtual environment

python3 -m venv venv
source venv/bin/activate

Install pip requirements for senteval and sentence embeddings

pip install -r requirements.txt

Setup sentence embeddings and download their checkpoints

Our pretrained multilingual Infersent, Quickthought and RandomLSTM embeddings (checkpoints) can be downloaded from here and placed in sentence_embeddings/embedder_data.

To download further embeddings:

cd sentence-embeddings
./download_requirements.sh

Download downstream tasks from senteval

cd senteval/data/downstream
./get_transfer_data.bash

How to run an experiment

The purpose of __main__.py is to generate sentence embeddings and automatically execute our modified version of SentEval on them. To run an experiment, specify a list of sentence embeddings and tasks, an output file and a classifier. You can specify additional parameters to modify the dataset before running experiments.

Example: python . -s avg -t WordContent -f results.json --mlp

The parameter s specifies the sentence embedding, t a list of tasks and f the result file. mlp sets the classifier to Multilayer Perceptron. For an exhaustive list of command line parameters, run python . --help.

If the result file already exists and contains valid json, the new results will be merged with the existing results. The results are written in the following json format:

{
  'Experiment 1 parameters': {
    'Sentence embedding 1': {
      'Task 1': {
        ...
      },
      'Task 2': {
        ...
      }
    }
  }
}

The file log.txt contains all parameters, results and the logs from senteval.

How to reproduce our results

The following table explains the values that we set the command line parameters to, when we executed our experiments. Afterwards we will give their individual shell commands.

Parameter	Values	Meaning
sentence_embeddings	avg; pmean; randomLSTM; infersent; quickthought; LASER; averageMultilingualBERT	Average Pooling; Power Means; Random LSTM; Infersent; QuickThoughts; LASER; mBERT with average pooling
ntrain	0.1; 0.5; 1 100000; 30000; 20000; 10000; 5000; 2000	10%; 50%; 100% of the training data 100000; 30000; 20000; 10000; 5000; 2000 samples
lang	en; ru; tr; ka	English; Russian; Turkish; Georgian
balance	1 5; 1 10	1:5; 1:10 relation between the class sizes

export embeddings=avg pmean randomLSTM infersent quickthought LASER averageMultilingualBERT

English Probing For each $classifier in mlp, log_reg, random_forest, naive_bayes and for each $ntrain in 100000, 30000, 20000, 10000, 5000, 2000:

python . --lang en -s $embeddings --$classifier --ntrain $ntrain -t Length WordContent Depth TopConstituents BigramShift SubjNumber SubjVerbAgreement SubjVerbDistance Voice -f results.json

Multilingual Probing For each $lang in tr, ru, ka (please note that not all probing tasks are avilable for all languages):

python . --lang $lang --log_reg --ntrain 10000 -t Length WordContent Depth TopConstituents BigramShift SubjNumber SubjVerbAgreement SubjVerbDistance Voice -f results.json

Multilingual Downstream For each $lang in tr, ru, ka:

python . --log_reg --lang $lang -t MArgMin MSenti MTREC
python . --log_reg --lang en -t MArgMin MSenti TREC

ncgamit/conll2020-multilingual-sentence-probing