Supplementary data for the paper the Dank or Not? -- Analyzing and Predicting Popularity of Memes on Reddit - Barnes et al. (2020)
@article{barnes2021dank,
title={Dank or not? Analyzing and predicting the popularity of memes on Reddit},
author={Barnes, Kate and Riesenmy, Tiernon and Trinh, Minh Duc and Lleshi, Eli and Balogh, N{\'o}ra and Molontay, Roland},
journal={Applied Network Science},
volume={6},
number={1},
pages={1--24},
year={2021},
publisher={SpringerOpen}
}
The data is scraped using a Pushshift API, and it consists of all posts starting from March 17th, 2020 to March 23rd, 2020 across these Reddit subreddits:
- MemeEconomy,
- dankmeme,
- memes,
- me_irl,
- dank_meme (this subreddit has gone private)
After cleaning, we end up with 80,362 posts that are used as training/testing data for our machine learning models. This spreadsheet contains the posts with their metadata as well as the generated attributes we used in the paper.
Feature | Type | Description |
---|---|---|
created_utc | UTC timestamp | time of post submission |
ups | integer | number of upvotes received |
is_nsfw | boolean | indicates if only suitable for 18+ |
subreddit | string | subreddit of the submission |
subscribers | integer | number of subscribers to the subreddit |
thumbnail.height | floating point value | height of the thumbnail |
thumbnail.thumbnail | string | thumbnail media |
thumbnail.width | floating point value | width of thumbnail |
title | string | title of the submission |
media | string | link to associated meme media |
ups_normed | floating point value | ups normalized with subscribers |
dank_level | integer | label ups_normed for binary classification |
processed_words | list of strings | filtered and stemmed words from title and image |
word_count | integer | number of words in title and image |
TextLength | integer | number of characters in title |
Sentiment | floating point value | text valence score |
avg_hue | floating point value | average HSV hue value of meme |
avg_saturation | floating point value | average HSV saturation value of meme |
avg_value | floating point value | average HSV value value of meme |
30 colors | floating point value | normalized pixels of color in image |
VGG_features | list of strings | VGG-16's first three guesses about image content |
VGG_probs | list of floating point values | the probabilities of the VGG-16's first three guesses |
For the Convoluted Neural Network section of the paper, we download and sample from the 76,000+ downloadable images provided in the media field of the posts. The raw image files can be found in the data folder.
dank | not_dank | Total | |
---|---|---|---|
Training set | 1,856 | 1,856 | 3,712 |
Validation set | 928 | 928 | 1,856 |
Test set | 929 | 929 | 1,858 |
Total | 3,713 | 3,713 | 7,426 |
The notebook for the Random Forest model along with some supplementary analysis for the model can be found in the notebook folder.
A more detailed description of the dataset and how we generate our attributes can be found in Dank or Not? -- Analyzing and Predicting Popularity of Memes on Reddit
The notebooks for the image analysis and features extraction can be found in the notebook folder.
Final plots and a more detailed description can be found in the Image Analysis section of Dank or Not? -- Analyzing and Predicting Popularity of Memes on Reddit