/reddit-analyzed

Finding similar subreddits using MinHash and SimRank algorithms

Primary LanguagePython

Reddit Analyzed

Find similar subreddits using MinHash and SimRank algorithms.

Done as term project for CS425 Web-Scale Data course.

Results

I implemented a graph visualization tool for results, using Cytoscape.js library.

You can check it out here.


Some examples:

Similar subreddits to Python Similar subs to r/Python


Similar subreddits to DotA2 Similar subs to r/DotA2


Similar subreddits to canada Similar subs to r/canada


Similar subreddits to soccer Similar subs to r/soccer


Algorithms Used

MinHash

I used MinHash algorithm to find number of common users of each subreddit.

Jaccard Similarity can be calculated using MinHash and it gives accurate and fast results. Number of common users can be calculated using Jaccard Similarity and number of users of each subreddit.

This was needed to preprocess user information which was the bottleneck of SimRank algorithm (~3M users) and ran SimRank on subreddits only.

SimRank

SimRank algorithm is an extension of PageRank algorithm to find similarity in a graph.

Random walk with restarts approach gives pretty accurate similarity results.

After processing the dataset using MinHash, I ran SimRank on a graph where nodes represent subreddits and weighted edges represent number of common users if exists, like below.

Sample graph

Collecting Data

Ran crawler.py on Heroku for 3 weeks and written the data in an Amazon RDS database.

Dataset contains,

  • 8300 different subreddits
  • 3 million unique active users
  • 10 million comments

Used PRAW library, a Python Reddit API wrapper.