The Dataset for Hate Speech Detection in Indonesian

(Dataset untuk Deteksi Ujaran Kebencian dalam Bahasa Indonesia)

Dataset
The dataset is a two columns data of: label - tweet, consist of 713 tweets in Indonesian.
The label is Non_HS or HS. Non_HS for "non-hate-speech" tweet and HS for "hate-speech" tweet.

Number of Non_HS tweets: 453
Number of HS tweets: 260
Since this dataset is unbalanced, you might have to do over-sampling/down-sampling in order to create a balanced dataset.

The dataset may be used freely, but if you want to publish paper/publication using the dataset, please cite this publication:

Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata, "Hate Speech Detection in Indonesian Language: A Dataset and Preliminary Study ", in Proceeding of 9th International Conference on Advanced Computer Science and Information Systems 2017(ICACSIS 2017).

marcianorama/id-hatespeech-detection

The Dataset for Hate Speech Detection in Indonesian