This is the repository for the TweetEval benchmark (Findings of EMNLP 2020). TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits.
These are the seven datasets of TweetEval, with its corresponding labels (more details about the format in the datasets directory):
-
Emotion Recognition: SemEval 2018 (Emotion Recognition) - 4 labels:
anger
,joy
,sadness
,optimism
-
Emoji Prediction, SemEval 2018 (Emoji Prediction) - 20 labels: ❤️, 😍, 😂
...
🌲, 📷, 😜 -
Irony Detection, SemEval 2018 (Irony Detection) - 2 labels:
irony
,not irony
-
Hate Speech Detection, SemEval 2019 (Hateval) - 2 labels:
hateful
,not hateful
-
Offensive Language Identification, SemEval 2019 (OffensEval) - 2 labels:
offensive
,not offensive
-
Sentiment Analysis*, SemEval 2017 (Sentiment Analysis in Twitter) - 3 labels:
positive
,neutral
,negative
-
Stance Detection*, SemEval 2016 (Detecting Stance in Tweets) - 3 labels:
favour
,neutral
,against
Note 1*: For stance there are five different target topics (Abortion, Atheism, Climate change, Feminism and Hillary Clinton), each of which contains its own training, validation and test data.
Note 2*: The sentiment dataset has been updated as of 17 December 2020. The update has been minimal and it was intended to fix a small number of sentences that were cropped.
Model | Emoji | Emotion | Hate | Irony | Offensive | Sentiment | Stance | ALL(TE) | Reference |
---|---|---|---|---|---|---|---|---|---|
BERTweet | 33.4 | 79.3 | 56.4 | 82.1 | 79.5 | 73.4 | 71.2 | 67.9 | BERTweet |
RoBERTa-Retrained | 31.4 | 78.5 | 52.3 | 61.7 | 80.5 | 72.8 | 69.3 | 65.2 | TweetEval |
RoBERTa-Base | 30.9 | 76.1 | 46.6 | 59.7 | 79.5 | 71.3 | 68 | 61.3 | TweetEval |
RoBERTa-Twitter | 29.3 | 72.0 | 49.9 | 65.4 | 77.1 | 69.1 | 66.7 | 61.4 | TweetEval |
FastText | 25.8 | 65.2 | 50.6 | 63.1 | 73.4 | 62.9 | 65.4 | 58.1 | TweetEval |
LSTM | 24.7 | 66.0 | 52.6 | 62.8 | 71.7 | 58.3 | 59.4 | 56.5 | TweetEval |
SVM | 29.3 | 64.7 | 36.7 | 61.7 | 52.3 | 62.9 | 67.3 | 53.5 | TweetEval |
Note*: Check the reference paper for details on the official metrics for each task
If you would like to have your results added to the leaderboard you can either submit a pull request or send an email to any of the paper authors with results and the predictions of your model. Please also submit a reference to a paper describing your approach.
For evaluating your system, you simply need an individual predictions file for each of the tasks. The format of the predictions file should be the same as the output examples in the predictions folder (one output label per line as per the original test file). The predictions included as an example in this repo correspond to the best model evaluated in the paper, i.e., RoBERTa re-trained on Twitter (RoB-Rt in the paper).
python evaluation_script.py
The script takes the TweetEval gold test labels and the predictions from the "predictions" folder by default, but you can set this to suit your needs as optional arguments.
Three optional arguments can be modified:
--tweeteval_path: Path to TweetEval datasets. Default: "./datasets/"
--predictions_path: Path to predictions directory. Default: "./predictions/"
--task: Use this to get single task detailed results (emoji|emotion|hate|irony|offensive|sentiment|stance). Default: ""
Evaluation script sample usage from the terminal with parameters:
python evaluation_script.py --tweeteval_path ./datasets/ --predictions_path ./predictions/ --task emoji
(this script would output the breakdown of the results for the emoji prediction task only)
You can download the best Twitter masked language model (RoBERTa-retrained in the paper) from 🤗HuggingFace here. We also provide task-specific models:
- Twitter-RoBERTa-emoji
- Twitter-RoBERTa-emotion
- Twitter-RoBERTa-hate
- Twitter-RoBERTa-irony
- Twitter-RoBERTa-offensive
- Twitter-RoBERTa-sentiment
- Twitter-RoBERTa-stance - Coming soon
To know how to use the pre-trained models, you can check our Google Colab Notebook, with sample code for masked language modeling, extracting embeddings from tweets and tweet classification.
NEW! A multilingual language model trained on Twitter for 30+ languages (XLM-T) is now available here
If you use TweetEval in your research, please use the following bib
entry to cite the reference paper.
@inproceedings{barbieri2020tweeteval,
title={{TweetEval:Unified Benchmark and Comparative Evaluation for Tweet Classification}},
author={Barbieri, Francesco and Camacho-Collados, Jose and Espinosa-Anke, Luis and Neves, Leonardo},
booktitle={Proceedings of Findings of EMNLP},
year={2020}
}
TweetEval is released without any restrictions but restrictions may apply to individual tasks (which are derived from existing datasets) or Twitter (main data source). We refer users to the original licenses accompanying each dataset and Twitter regulations.
If you use any of the TweetEval datasets, please cite their original publications:
@inproceedings{mohammad2018semeval,
title={Semeval-2018 task 1: Affect in tweets},
author={Mohammad, Saif and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},
booktitle={Proceedings of the 12th international workshop on semantic evaluation},
pages={1--17},
year={2018}
}
@inproceedings{barbieri2018semeval,
title={Semeval 2018 task 2: Multilingual emoji prediction},
author={Barbieri, Francesco and Camacho-Collados, Jose and Ronzano, Francesco and Espinosa-Anke, Luis and
Ballesteros, Miguel and Basile, Valerio and Patti, Viviana and Saggion, Horacio},
booktitle={Proceedings of The 12th International Workshop on Semantic Evaluation},
pages={24--33},
year={2018}
}
@inproceedings{van2018semeval,
title={Semeval-2018 task 3: Irony detection in english tweets},
author={Van Hee, Cynthia and Lefever, Els and Hoste, V{\'e}ronique},
booktitle={Proceedings of The 12th International Workshop on Semantic Evaluation},
pages={39--50},
year={2018}
}
@inproceedings{basile-etal-2019-semeval,
title = "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter",
author = "Basile, Valerio and Bosco, Cristina and Fersini, Elisabetta and Nozza, Debora and Patti, Viviana and
Rangel Pardo, Francisco Manuel and Rosso, Paolo and Sanguinetti, Manuela",
booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
year = "2019",
address = "Minneapolis, Minnesota, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/S19-2007",
doi = "10.18653/v1/S19-2007",
pages = "54--63"
}
@inproceedings{zampieri2019semeval,
title={SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)},
author={Zampieri, Marcos and Malmasi, Shervin and Nakov, Preslav and Rosenthal, Sara and Farra, Noura and Kumar, Ritesh},
booktitle={Proceedings of the 13th International Workshop on Semantic Evaluation},
pages={75--86},
year={2019}
}
@inproceedings{rosenthal2017semeval,
title={SemEval-2017 task 4: Sentiment analysis in Twitter},
author={Rosenthal, Sara and Farra, Noura and Nakov, Preslav},
booktitle={Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017)},
pages={502--518},
year={2017}
}
@inproceedings{mohammad2016semeval,
title={Semeval-2016 task 6: Detecting stance in tweets},
author={Mohammad, Saif and Kiritchenko, Svetlana and Sobhani, Parinaz and Zhu, Xiaodan and Cherry, Colin},
booktitle={Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)},
pages={31--41},
year={2016}
}