This dataset contains 4.403 Indonesian tweets which are labeled into five emotion classes: love, anger, sadness, joy and fear.
Each line consists of a tweet and its respective emotion label separated by semicolon (,). The first line is a header. For a tweet with coma (,) inside the text, there is an quote (" ") to avoid column separation.
The tweets in this dataset has been pre-processed using the following criterias:
- Username mention (@) has been replaced with term [USERNAME]
- URL/hyperlink (http://... or https://...) has been replaced with term [URL]
- Sensitive number, such as phone number, invoice number and courier tracking number has been replaced with term [SENSITIVE-NO]
We have trained 1 Millions Indonesian tweets into Word2Vec and FastText vector. Those pre-trained word embedding can be downloaded here.
If you want to publish a paper using this dataset and pre-trained word embedding, please cite this publication:
Mei Silviana Saputri, Rahmad Mahendra, and Mirna Adriani, "Emotion Classification on Indonesian Twitter Dataset", in Proceeding of International Conference on Asian Language Processing 2018. 2018.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.