
Understanding hateful subreddits through text clustering

Primary LanguagePythonMIT LicenseMIT

Reddit Clusters

Understanding hateful subreddits through text clustering.

Directory Structure

├── clustering
│   ├── results
│   │   ├── CringeAnarchy.txt
│   │   ├── The_Donald.txt
│   │   └── TheRedPill.txt
│   ├── run_nmfs.sh
│   └── tfidf_nmf.py
├── data
│   ├── bigquery
│   │   └── 2017
│   │       └── 11-12
│   │           └── download_data.sh
│   └── stoplist.txt
├── README.md
└── wordclouds
    ├── blackdisc.jpeg
    ├── images
    │   ├── CringeAnarchy
    │   │   ├── 0.06%.png
    │   │   ├── 0.12%.png
    │   │   ├── ...
    │   │   ├── ...
    │   ├── The_Donald
    │   │   ├── 0.36%.png
    │   │   ├── 0.51%.png
    │   │   ├── ...
    │   │   ├── ...
    │   └── TheRedPill
    │       ├── 0.56%.png
    │       ├── 0.91%.png
    │       ├── ...
    │       └── ...
    ├── make_wordclouds.ipynb
    └── OCR-A-Std-Regular.ttf

The clustering directory contains all code used to vectorize and cluster the subreddits. tfidf_nmf.py is the main program, and run_nmfs.sh is simply a driver script. The results subdirectory contains the log files when running tfidf_nmf.py on /r/TheRedPill, /r/The_Donald and /r/CringeAnarchy.

The data directory contains download_data.sh (which downloads the Reddit data from my Google Cloud Storage) and stoplist.txt (which includes Reddit-specific words such as "moderator", "karma", etc.).

The wordclouds directory contains images (which contains png files of the wordclouds themselves) and make_wordclouds.ipynb (which generates the wordclouds). Note that 1) blackdisc.jpeg and OCR-A-Std-Regular.ttf are just helper files to create the wordclouds, and 2) the png files are named after their cluster importance (e.g. 0.50%.png is a wordcloud whose cluster has an importance of 0.50%).