SentEval: evaluation toolkit for sentence embeddings

SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks. SentEval currently includes 17 downstream tasks. We also include a suite of 10 probing tasks which evaluate what linguistic properties are encoded in sentence embeddings. Our goal is to ease the study and the development of general-purpose fixed-size sentence representations.

(04/22) SentEval new tasks: Added probing tasks for evaluating what linguistic properties are encoded in sentence embeddings

(10/04) SentEval example scripts for three sentence encoders: SkipThought-LN/GenSen/Google-USE

Dependencies

This code is written in python. The dependencies are:

pip install --editable sentence-transformers/

Transfer tasks

Downstream tasks

SentEval allows you to evaluate your sentence embeddings as features for the following downstream tasks:

Task	Type	#train	#test	needs_train	set_classifier
MR	movie review	11k	11k	1	1
CR	product review	4k	4k	1	1
SUBJ	subjectivity status	10k	10k	1	1
MPQA	opinion-polarity	11k	11k	1	1
SST	binary sentiment analysis	67k	1.8k	1	1
SST	fine-grained sentiment analysis	8.5k	2.2k	1	1
TREC	question-type classification	6k	0.5k	1	1
SICK-E	natural language inference	4.5k	4.9k	1	1
SNLI	natural language inference	550k	9.8k	1	1
MRPC	paraphrase detection	4.1k	1.7k	1	1
STS 2012	semantic textual similarity	N/A	3.1k	0	0
STS 2013	semantic textual similarity	N/A	1.5k	0	0
STS 2014	semantic textual similarity	N/A	3.7k	0	0
STS 2015	semantic textual similarity	N/A	8.5k	0	0
STS 2016	semantic textual similarity	N/A	9.2k	0	0
STS B	semantic textual similarity	5.7k	1.4k	1	0
SICK-R	semantic textual similarity	4.5k	4.9k	1	0
COCO	image-caption retrieval	567k	5*1k	1	0

where needs_train means a model with parameters is learned on top of the sentence embeddings, and set_classifier means you can define the parameters of the classifier in the case of a classification task (see below).

Note: COCO comes with ResNet-101 2048d image embeddings. More details on the tasks.

Probing tasks

SentEval also includes a series of probing tasks to evaluate what linguistic properties are encoded in your sentence embeddings:

Task	Type	#train	#test	needs_train	set_classifier
SentLen	Length prediction	100k	10k	1	1
WC	Word Content analysis	100k	10k	1	1
TreeDepth	Tree depth prediction	100k	10k	1	1
TopConst	Top Constituents prediction	100k	10k	1	1
BShift	Word order analysis	100k	10k	1	1
Tense	Verb tense prediction	100k	10k	1	1
SubjNum	Subject number prediction	100k	10k	1	1
ObjNum	Object number prediction	100k	10k	1	1
SOMO	Semantic odd man out	100k	10k	1	1
CoordInv	Coordination Inversion	100k	10k	1	1

Download datasets

To get all the transfer tasks datasets, run (in data/downstream/):

./get_transfer_data.bash

This will automatically download and preprocess the downstream datasets, and store them in data/downstream (warning: for MacOS users, you may have to use p7zip instead of unzip). The probing tasks are already in data/probing by default.

How to use SentEval: examples

examples/bow.py

In examples/bow.py, we evaluate the quality of the average of word embeddings.

To download state-of-the-art fastText embeddings:

curl -Lo glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
curl -Lo crawl-300d-2M.vec.zip https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip

To reproduce the results for bag-of-vectors, run (in examples/):

python bow.py

As required by SentEval, this script implements two functions: prepare (optional) and batcher (required) that turn text sentences into sentence embeddings. Then SentEval takes care of the evaluation on the transfer tasks using the embeddings as features.

examples/infersent.py

To get the InferSent model and reproduce our results, download our best models and run infersent.py (in examples/):

curl -Lo examples/infersent1.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent1.pkl
curl -Lo examples/infersent2.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent2.pkl

examples/skipthought.py - examples/gensen.py - examples/googleuse.py

We also provide example scripts for three other encoders:

SkipThought with Layer-Normalization in Theano
GenSen encoder in Pytorch
Google encoder in TensorFlow

Note that for SkipThought and GenSen, following the steps of the associated githubs is necessary. The Google encoder script should work as-is.

How to use SentEval

To evaluate your sentence embeddings, SentEval requires that you implement two functions:

prepare (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
batcher (transforms a batch of text sentences into sentence embeddings)

1.) prepare(params, samples) (optional)

batcher only sees one batch at a time while the samples argument of prepare contains all the sentences of a task.

prepare(params, samples)

params: senteval parameters.
samples: list of all sentences from the tranfer task.
output: No output. Arguments stored in "params" can further be used by batcher.

Example: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.

2.) batcher(params, batch)

batcher(params, batch)

params: senteval parameters.
batch: numpy array of text sentences (of size params.batch_size)
output: numpy array of sentence embeddings (of size params.batch_size)

Example: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.

3.) evaluation on transfer tasks

After having implemented the batch and prepare function for your own sentence encoder,

to perform the actual evaluation, first import senteval and set its parameters:

import senteval
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}

(optional) set the parameters of the classifier (when applicable):

params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
                                 'tenacity': 5, 'epoch_size': 4}

You can choose nhid=0 (Logistic Regression) or nhid>0 (MLP) and define the parameters for training.

Create an instance of the class SE:

se = senteval.engine.SE(params, batcher, prepare)

define the set of transfer tasks and run the evaluation:

transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
results = se.eval(transfer_tasks)

The current list of available tasks is:

['CR', 'MR', 'MPQA', 'SUBJ', 'SST2', 'SST5', 'TREC', 'MRPC', 'SNLI',
'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval',
'STS12', 'STS13', 'STS14', 'STS15', 'STS16',
'Length', 'WordContent', 'Depth', 'TopConstituents','BigramShift', 'Tense',
'SubjNumber', 'ObjNumber', 'OddManOut', 'CoordinationInversion']

SentEval parameters

Global parameters of SentEval:

# senteval parameters
task_path                   # path to SentEval datasets (required)
seed                        # seed
usepytorch                  # use cuda-pytorch (else scikit-learn) where possible
kfold                       # k-fold validation for MR/CR/SUB/MPQA.

Parameters of the classifier:

nhid:                       # number of hidden units (0: Logistic Regression, >0: MLP); Default nonlinearity: Tanh
optim:                      # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..)
tenacity:                   # how many times dev acc does not increase before training stops
epoch_size:                 # each epoch corresponds to epoch_size pass on the train set
max_epoch:                  # max number of epoches
dropout:                    # dropout for MLP

Note that to get a proxy of the results while dramatically reducing computation time, we suggest the prototyping config:

params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
                                 'tenacity': 3, 'epoch_size': 2}

which will results in a 5 times speedup for classification tasks.

To produce results that are comparable to the literature, use the default config:

params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
                                 'tenacity': 5, 'epoch_size': 4}

which takes longer but will produce better and comparable results.

For probing tasks, we used an MLP with a Sigmoid nonlinearity and and tuned the nhid (in [50, 100, 200]) and dropout (in [0.0, 0.1, 0.2]) on the dev set.

References

Please considering citing [1] if using this code for evaluating sentence embedding methods.

SentEval: An Evaluation Toolkit for Universal Sentence Representations

[1] A. Conneau, D. Kiela, SentEval: An Evaluation Toolkit for Universal Sentence Representations

@article{conneau2018senteval,
  title={SentEval: An Evaluation Toolkit for Universal Sentence Representations},
  author={Conneau, Alexis and Kiela, Douwe},
  journal={arXiv preprint arXiv:1803.05449},
  year={2018}
}

Contact: aconneau@fb.com, dkiela@fb.com

bcmi220/SentEval