/HatefulUsersTwitter

Code for the paper "Characterizing and Detecting Hateful Users on Twitter"

Primary LanguageJupyter NotebookMIT LicenseMIT

Hateful Users on Twitter

This folder contains the data and the analysis done in the paper:

@inproceedings{ribeiro2018characterizing,
title={Characterizing and Detecting Hateful Users on Twitter},
author={Horta Ribeiro, Manoel and Calais, Pedro and 
        Santos, Yuri and Almeida, Virg{\'\i}lio and Meira Jr, Wagner},
booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
year={2018}
}

The experiments with the GraphSage algorithm are in another repository.

The dataset can be downloaded here on Kaggle.

Data and Reproducibility

This dataset contains a network of 100k users, out of which ~5k were annotated as hateful or not. For each user, several content-related, network-related and activity related features were provided. Some of the files used are not shared because sharing them violates Twitter's guidelines.

You can download the following files here:

  • bad_words.txt list of bad words matched in the tweets.
  • lexicon.txt list of lexicon used in the diffusion method.

And the following files on Kaggle:

  • users_(hate|suspended)_(glove|all).content files with the feature vector for each user and their classes, the ones with hate label users as either hateful, normal or other, whereas the ones with suspended label users as either suspended or active. The ones with glove have only the glove vectors as features, the ones with all have other attributes related to users activity and network centrality. This is only for the GraphSage algorithm.

  • user.edges file with all the (directed) edges in the retweet graph.

  • users_clean.graphml networkx compatible file with retweet network. User id's correspond to those in users_anon_neighborhood.csv!

  • users_anon_neighborhood.csv file with several features for each user as well as the avg for some features for their 1-neighborhood (ppl they tweeted). Notice that c_ are attributes calculated for the 1-neighborhood of a user in the retweet network (averaged out).

Attributes description

  hate :("hateful"|"normal"|"other")
  if user was annotated as hateful, normal, or not annotated.
  
  (is_50|is_50_2) :bool
  whether user was deleted up to 12/12/17 or 14/01/18. 
  
  (is_63|is_63_2) :bool
  whether user was suspended up to 12/12/17 or 14/01/18. 
        
  (hate|normal)_neigh :bool
  is the user on the neighborhood of a (hateful|normal) user? 
  
  [c_] (statuses|follower|followees|favorites)_count :int
  number of (tweets|follower|followees|favorites) a user has.
  
  [c_] listed_count:int
  number of lists a user is in.
    
  [c_] (betweenness|eigenvector|in_degree|outdegree) :float
  centrality measurements for each user in the retweet graph.
  
  [c_] *_empath :float
  occurrences of empath categories in the users latest 200 tweets.
  
  [c_] *_glove :float          
  glove vector calculated for users latest 200 tweets.
  
  [c_] (sentiment|subjectivity) :float
  average sentiment and subjectivity of users tweets.
  
  [c_] (time_diff|time_diff_median) :float
  average and median time difference between tweets.
  
  [c_] (tweet|retweet|quote) number :float
  percentage of direct tweets, retweets and quotes of an user.
  
  [c_] (number urls|number hashtags|baddies|mentions) :float
  number of bad words|mentions|urls|hashtags per tweet in average.
  
  [c_] status length :float
  average status length.
  
  hashtags :string
  all hashtags employed by the user separated by spaces.

Folder Structure

These are the main folders, reproducible with the dataset downloaded from Kaggle:

  • ./analysis/ contains the script exploring the dataset collected.

  • ./classification/ contains scripts with boosting classifier.

These folders are not reproducible, but they are present just in for completeness:

  • ./crawler/ contains the code used to extract the dataset. You need to set neo4j to run it.

  • ./prepreprocessing/ contains scripts to select the users to be annotated, and extract their tweets.

  • ./features/ contains scripts to get the features to be analyzed and that will be fed into the classifier.

Auxiliary folders:

  • ./data/ data generated by data wrangling.

  • ./secrets/ for the API/DB authentication stuff.

  • ./tmp/ auxiliary scripts.

  • ./img/ images generated by analyses.