I use reddit comments dataset to compute related subreddits.
Basic principle of this recommender is "redditors who posted to this subreddit also post to ...". In math terms, I'm just computing Jaccard index
The code is not supposed to be looked at, but check this early results:
related to /r/programming
- /r/ProgrammerHumor - simlarity 0.0583
- /r/linux - simlarity 0.0560
- /r/learnprogramming - simlarity 0.0406
- /r/webdev - simlarity 0.0405
- /r/technology - simlarity 0.0345
- /r/Python - simlarity 0.0341
- /r/cscareerquestions - simlarity 0.0322
- /r/javascript - simlarity 0.0297
- /r/gamedev - simlarity 0.0284
- /r/compsci - simlarity 0.0261
related to /r/gamedev
- /r/Unity3D - simlarity 0.0734
- /r/IndieGaming - simlarity 0.0445
- /r/Unity2D - simlarity 0.0348
- /r/programming - simlarity 0.0334
- /r/gamedesign - simlarity 0.0305
- /r/gameDevClassifieds - simlarity 0.0247
- /r/learnprogramming - simlarity 0.0232
- /r/unrealengine - simlarity 0.0209
- /r/ProgrammerHumor - simlarity 0.0203
related to /r/vim
- /r/neovim - simlarity 0.0369
- /r/commandline - simlarity 0.0348
- /r/archlinux - simlarity 0.0304
- /r/unixporn - simlarity 0.0252
- /r/i3wm - simlarity 0.0181
- /r/linux - simlarity 0.0181
- /r/haskell - simlarity 0.0176
- /r/emacs - simlarity 0.0172
- /r/Python - simlarity 0.0171
related to /r/visualization
- /r/tableau - simlarity 0.0210
- /r/IPython - simlarity 0.0177
- /r/datasets - simlarity 0.0141
- /r/rstats - simlarity 0.0131
- /r/AjaxAmsterdam - simlarity 0.0128
- /r/dataisugly - simlarity 0.0127
- /r/Mousesports - simlarity 0.0125
- /r/punkcirclejerk - simlarity 0.0125
- /r/le_bald_shitter - simlarity 0.0125
related to /r/Seattle
- /r/Seahawks - simlarity 0.0458
- /r/Mariners - simlarity 0.0257
- /r/Washington - simlarity 0.0180
- /r/SoundersFC - simlarity 0.0167
- /r/MLS - simlarity 0.0135
- /r/udub - simlarity 0.0118
- /r/Tacoma - simlarity 0.0117
- /r/politics - simlarity 0.0105
- /r/bicycling - simlarity 0.0102
related to /r/nyc
- /r/AskNYC - simlarity 0.0776
- /r/Brooklyn - simlarity 0.0545
- /r/NYCbike - simlarity 0.0250
- /r/newjersey - simlarity 0.0185
- /r/rangers - simlarity 0.0182
- /r/NewYorkMets - simlarity 0.0166
- /r/TrueReddit - simlarity 0.0166
- /r/longisland - simlarity 0.0157
- /r/astoria - simlarity 0.0140
While results looks very promissing for subreddits with less than one million subscribers, more popular subreddits unfortunately get their results saturated with other popular subreddits:
related to /r/books
- /r/television - simlarity 0.0578
- /r/explainlikeimfive - simlarity 0.0564
- /r/movies - simlarity 0.0535
- /r/news - simlarity 0.0525
- /r/nottheonion - simlarity 0.0499
- /r/todayilearned - simlarity 0.0487
- /r/worldnews - simlarity 0.0475
- /r/Showerthoughts - simlarity 0.0452
- /r/gifs - simlarity 0.0434
If you have an idea how to fix this please let me know :)
MIT