FTR-dataset

Dataset of tweets in French annotated for racist speech

In total, we have collected 2856 tweets—1929 non-racist tweets ( 68%), and 927 racist tweets (32%). The average number of words in a tweet (before cleaning) is 23.45, and the average number of characters is 125.15.

The data were obtained by archiving a real-time Twitter stream. The language was chosen to be French during the streaming process. The label 0 was attributed to a no racial speech tweet and 1 to a racist speech tweet.

The FTR dataset was annotated by two French native speakers; the Kappa agreement coefficient between them was 0.66. In the case of disagreement, a third annotator assigned the final label.

If you use this dataset, please cite our paper:

Detection of Racist Language in French Tweets Natalia Vanetik and Elisheva Mimoun Information Hournal, TBP in 2022

lheuveline/FTR-dataset

FTR-dataset