SentEval: evaluation toolkit for sentence embeddings
SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks. SentEval currently includes 17 downstream tasks. We also include a suite of 10 probing tasks which evaluate what linguistic properties are encoded in sentence embeddings. Our goal is to ease the study and the development of general-purpose fixed-size sentence representations.
(04/22) SentEval new tasks: Added probing tasks for evaluating what linguistic properties are encoded in sentence embeddings
(10/04) SentEval example scripts for three sentence encoders: SkipThought-LN/GenSen/Google-USE
Dependencies
This code is written in python. The dependencies are:
- Python 2/3 with NumPy/SciPy
- Pytorch>=0.4
- scikit-learn>=0.18.0
Transfer tasks
Downstream tasks
SentEval allows you to evaluate your sentence embeddings as features for the following downstream tasks:
Task | Type | #train | #test | needs_train | set_classifier |
---|---|---|---|---|---|
MR | movie review | 11k | 11k | 1 | 1 |
CR | product review | 4k | 4k | 1 | 1 |
SUBJ | subjectivity status | 10k | 10k | 1 | 1 |
MPQA | opinion-polarity | 11k | 11k | 1 | 1 |
SST | binary sentiment analysis | 67k | 1.8k | 1 | 1 |
SST | fine-grained sentiment analysis | 8.5k | 2.2k | 1 | 1 |
TREC | question-type classification | 6k | 0.5k | 1 | 1 |
SICK-E | natural language inference | 4.5k | 4.9k | 1 | 1 |
SNLI | natural language inference | 550k | 9.8k | 1 | 1 |
MRPC | paraphrase detection | 4.1k | 1.7k | 1 | 1 |
STS 2012 | semantic textual similarity | N/A | 3.1k | 0 | 0 |
STS 2013 | semantic textual similarity | N/A | 1.5k | 0 | 0 |
STS 2014 | semantic textual similarity | N/A | 3.7k | 0 | 0 |
STS 2015 | semantic textual similarity | N/A | 8.5k | 0 | 0 |
STS 2016 | semantic textual similarity | N/A | 9.2k | 0 | 0 |
STS B | semantic textual similarity | 5.7k | 1.4k | 1 | 0 |
SICK-R | semantic textual similarity | 4.5k | 4.9k | 1 | 0 |
COCO | image-caption retrieval | 567k | 5*1k | 1 | 0 |
where needs_train means a model with parameters is learned on top of the sentence embeddings, and set_classifier means you can define the parameters of the classifier in the case of a classification task (see below).
Note: COCO comes with ResNet-101 2048d image embeddings. More details on the tasks.
Probing tasks
SentEval also includes a series of probing tasks to evaluate what linguistic properties are encoded in your sentence embeddings:
Task | Type | #train | #test | needs_train | set_classifier |
---|---|---|---|---|---|
SentLen | Length prediction | 100k | 10k | 1 | 1 |
WC | Word Content analysis | 100k | 10k | 1 | 1 |
TreeDepth | Tree depth prediction | 100k | 10k | 1 | 1 |
TopConst | Top Constituents prediction | 100k | 10k | 1 | 1 |
BShift | Word order analysis | 100k | 10k | 1 | 1 |
Tense | Verb tense prediction | 100k | 10k | 1 | 1 |
SubjNum | Subject number prediction | 100k | 10k | 1 | 1 |
ObjNum | Object number prediction | 100k | 10k | 1 | 1 |
SOMO | Semantic odd man out | 100k | 10k | 1 | 1 |
CoordInv | Coordination Inversion | 100k | 10k | 1 | 1 |
Download datasets
To get all the transfer tasks datasets, run (in data/downstream/):
./get_transfer_data.bash
This will automatically download and preprocess the downstream datasets, and store them in data/downstream (warning: for MacOS users, you may have to use p7zip instead of unzip). The probing tasks are already in data/probing by default.
How to use SentEval: examples
examples/bow.py
In examples/bow.py, we evaluate the quality of the average of word embeddings.
To download state-of-the-art fastText embeddings:
curl -Lo glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
curl -Lo crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
To reproduce the results for bag-of-vectors, run (in examples/):
python bow.py
As required by SentEval, this script implements two functions: prepare (optional) and batcher (required) that turn text sentences into sentence embeddings. Then SentEval takes care of the evaluation on the transfer tasks using the embeddings as features.
examples/infersent.py
To get the InferSent model and reproduce our results, download our best models and run infersent.py (in examples/):
curl -Lo examples/infersent1.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent1.pkl
curl -Lo examples/infersent2.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent2.pkl
examples/skipthought.py - examples/gensen.py - examples/googleuse.py
We also provide example scripts for three other encoders:
- SkipThought with Layer-Normalization in Theano
- GenSen encoder in Pytorch
- Google encoder in TensorFlow
Note that for SkipThought and GenSen, following the steps of the associated githubs is necessary. The Google encoder script should work as-is.
How to use SentEval
To evaluate your sentence embeddings, SentEval requires that you implement two functions:
- prepare (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
- batcher (transforms a batch of text sentences into sentence embeddings)
1.) prepare(params, samples) (optional)
batcher only sees one batch at a time while the samples argument of prepare contains all the sentences of a task.
prepare(params, samples)
- params: senteval parameters.
- samples: list of all sentences from the tranfer task.
- output: No output. Arguments stored in "params" can further be used by batcher.
Example: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.
2.) batcher(params, batch)
batcher(params, batch)
- params: senteval parameters.
- batch: numpy array of text sentences (of size params.batch_size)
- output: numpy array of sentence embeddings (of size params.batch_size)
Example: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.
3.) evaluation on transfer tasks
After having implemented the batch and prepare function for your own sentence encoder,
- to perform the actual evaluation, first import senteval and set its parameters:
import senteval
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
- (optional) set the parameters of the classifier (when applicable):
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
'tenacity': 5, 'epoch_size': 4}
You can choose nhid=0 (Logistic Regression) or nhid>0 (MLP) and define the parameters for training.
- Create an instance of the class SE:
se = senteval.engine.SE(params, batcher, prepare)
- define the set of transfer tasks and run the evaluation:
transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
results = se.eval(transfer_tasks)
The current list of available tasks is:
['CR', 'MR', 'MPQA', 'SUBJ', 'SST2', 'SST5', 'TREC', 'MRPC', 'SNLI',
'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval',
'STS12', 'STS13', 'STS14', 'STS15', 'STS16',
'Length', 'WordContent', 'Depth', 'TopConstituents','BigramShift', 'Tense',
'SubjNumber', 'ObjNumber', 'OddManOut', 'CoordinationInversion']
SentEval parameters
Global parameters of SentEval:
# senteval parameters
task_path # path to SentEval datasets (required)
seed # seed
usepytorch # use cuda-pytorch (else scikit-learn) where possible
kfold # k-fold validation for MR/CR/SUB/MPQA.
Parameters of the classifier:
nhid: # number of hidden units (0: Logistic Regression, >0: MLP); Default nonlinearity: Tanh
optim: # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..)
tenacity: # how many times dev acc does not increase before training stops
epoch_size: # each epoch corresponds to epoch_size pass on the train set
max_epoch: # max number of epoches
dropout: # dropout for MLP
Note that to get a proxy of the results while dramatically reducing computation time, we suggest the prototyping config:
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
'tenacity': 3, 'epoch_size': 2}
which will results in a 5 times speedup for classification tasks.
To produce results that are comparable to the literature, use the default config:
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
'tenacity': 5, 'epoch_size': 4}
which takes longer but will produce better and comparable results.
For probing tasks, we used an MLP with a Sigmoid nonlinearity and and tuned the nhid (in [50, 100, 200]) and dropout (in [0.0, 0.1, 0.2]) on the dev set.
References
Please considering citing [1] if using this code for evaluating sentence embedding methods.
SentEval: An Evaluation Toolkit for Universal Sentence Representations
[1] A. Conneau, D. Kiela, SentEval: An Evaluation Toolkit for Universal Sentence Representations
@article{conneau2018senteval,
title={SentEval: An Evaluation Toolkit for Universal Sentence Representations},
author={Conneau, Alexis and Kiela, Douwe},
journal={arXiv preprint arXiv:1803.05449},
year={2018}
}
Contact: aconneau@fb.com, dkiela@fb.com
Related work
- J. R Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, S. Fidler - SkipThought Vectors, NIPS 2015
- S. Arora, Y. Liang, T. Ma - A Simple but Tough-to-Beat Baseline for Sentence Embeddings, ICLR 2017
- Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y. Goldberg - Fine-grained analysis of sentence embeddings using auxiliary prediction tasks, ICLR 2017
- A. Conneau, D. Kiela, L. Barrault, H. Schwenk, A. Bordes - Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, EMNLP 2017
- S. Subramanian, A. Trischler, Y. Bengio, C. J Pal - Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning, ICLR 2018
- A. Nie, E. D. Bennett, N. D. Goodman - DisSent: Sentence Representation Learning from Explicit Discourse Relations, 2018
- D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil - Universal Sentence Encoder, 2018
- A. Conneau, G. Kruszewski, G. Lample, L. Barrault, M. Baroni - What you can cram into a single vector: Probing sentence embeddings for linguistic properties, ACL 2018