/id-hatespeech-detection

The Dataset for Hate Speech Detection in the Indonesian Language (Bahasa Indonesia)

The Dataset for Hate Speech Detection in Indonesian

(Dataset untuk Deteksi Ujaran Kebencian dalam Bahasa Indonesia)

Dataset
The dataset is a two columns data of: label - tweet, consist of 713 tweets in Indonesian.
The label is Non_HS or HS. Non_HS for "non-hate-speech" tweet and HS for "hate-speech" tweet.

  • Number of Non_HS tweets: 453
  • Number of HS tweets: 260
    Since this dataset is unbalanced, you might have to do over-sampling/down-sampling in order to create a balanced dataset.

The dataset may be used freely, but if you want to publish paper/publication using the dataset, please cite this publication:

Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata, "Hate Speech Detection in Indonesian Language: A Dataset and Preliminary Study ", in Proceeding of 9th International Conference on Advanced Computer Science and Information Systems 2017(ICACSIS 2017).