reddata

I use reddit comments dataset to compute related subreddits.

Basic principle of this recommender is "redditors who posted to this subreddit also post to ...". In math terms, I'm just computing Jaccard index

The code is not supposed to be looked at, but check this early results:

related to /r/programming

/r/ProgrammerHumor - simlarity 0.0583
/r/linux - simlarity 0.0560
/r/learnprogramming - simlarity 0.0406
/r/webdev - simlarity 0.0405
/r/technology - simlarity 0.0345
/r/Python - simlarity 0.0341
/r/cscareerquestions - simlarity 0.0322
/r/javascript - simlarity 0.0297
/r/gamedev - simlarity 0.0284
/r/compsci - simlarity 0.0261

related to /r/gamedev

/r/Unity3D - simlarity 0.0734
/r/IndieGaming - simlarity 0.0445
/r/Unity2D - simlarity 0.0348
/r/programming - simlarity 0.0334
/r/gamedesign - simlarity 0.0305
/r/gameDevClassifieds - simlarity 0.0247
/r/learnprogramming - simlarity 0.0232
/r/unrealengine - simlarity 0.0209
/r/ProgrammerHumor - simlarity 0.0203

related to /r/vim

/r/neovim - simlarity 0.0369
/r/commandline - simlarity 0.0348
/r/archlinux - simlarity 0.0304
/r/unixporn - simlarity 0.0252
/r/i3wm - simlarity 0.0181
/r/linux - simlarity 0.0181
/r/haskell - simlarity 0.0176
/r/emacs - simlarity 0.0172
/r/Python - simlarity 0.0171

related to /r/visualization

/r/tableau - simlarity 0.0210
/r/IPython - simlarity 0.0177
/r/datasets - simlarity 0.0141
/r/rstats - simlarity 0.0131
/r/AjaxAmsterdam - simlarity 0.0128
/r/dataisugly - simlarity 0.0127
/r/Mousesports - simlarity 0.0125
/r/punkcirclejerk - simlarity 0.0125
/r/le_bald_shitter - simlarity 0.0125

related to /r/Seattle

/r/Seahawks - simlarity 0.0458
/r/Mariners - simlarity 0.0257
/r/Washington - simlarity 0.0180
/r/SoundersFC - simlarity 0.0167
/r/MLS - simlarity 0.0135
/r/udub - simlarity 0.0118
/r/Tacoma - simlarity 0.0117
/r/politics - simlarity 0.0105
/r/bicycling - simlarity 0.0102

related to /r/nyc

/r/AskNYC - simlarity 0.0776
/r/Brooklyn - simlarity 0.0545
/r/NYCbike - simlarity 0.0250
/r/newjersey - simlarity 0.0185
/r/rangers - simlarity 0.0182
/r/NewYorkMets - simlarity 0.0166
/r/TrueReddit - simlarity 0.0166
/r/longisland - simlarity 0.0157
/r/astoria - simlarity 0.0140

While results looks very promissing for subreddits with less than one million subscribers, more popular subreddits unfortunately get their results saturated with other popular subreddits:

related to /r/books

/r/television - simlarity 0.0578
/r/explainlikeimfive - simlarity 0.0564
/r/movies - simlarity 0.0535
/r/news - simlarity 0.0525
/r/nottheonion - simlarity 0.0499
/r/todayilearned - simlarity 0.0487
/r/worldnews - simlarity 0.0475
/r/Showerthoughts - simlarity 0.0452
/r/gifs - simlarity 0.0434

If you have an idea how to fix this please let me know :)

license

MIT

anvaka/reddata