/reddit-stance-classifier

A Flask webapp & Python scripts for predicting reddit users' political leaning, using their comment history.

Primary LanguagePython

reddit-stance-classifier

A Flask webapp & Python scripts for predicting reddit users' political leaning, using their comment history.

usage

View the live webapp

Alternatively: Run one of the classifiers from the command line using the interactive shell option and call pred_lean

\reddit-stance-classifier>python
>>>from prediction import pred_lean
>>>pred_lean('userMcUserFace01010101')
('L', 'L', 0.821243598285102, 0.893544755401233)

The tuple returned is of the form (h_stance, v_stance, h_confidence, v_confidence)

example instance of data

"userMcUserFace01010101": {
  "stance": "libleft",
  "subs": {
    "rollercoasters": 1037,
    "CasualUK": 101,
    "PewdiepieSubmissions": 90,
    "polandball": 68,
    "unpopularopinion": 65,
    "todayilearned": 64,
    "LosAngeles": 62,
    "ShitAmericansSay": 53,
    "im14andthisisdeep": 53,
    "london": 32,
    "reclassified": 26,
    "TheRightCantMeme": 25,
    "CringeAnarchy": 22,
    "HongKong": 22,
    "MovieDetails": 22,
    "GenZ": 21,
  }
}

In this example the "subs" from this instance of data would be encoded into a sparse array and passed into the model as features. The model that has subsequently been trained to predict "stance" would then make a prediction for this new instance of data.

conclusion

As of writing a precision and recall of ~0.8 can be achieved on the unseen test set. It is important to note however, that there may be significant selection bias as all instances of data are from users of r/politicalcompassmemes. Therefore it remains to be seen whether this approach to identifying political positions will generalise to the Reddit population as a whole and make sensible predictions.

Due to the significant class imbalance present in the training data (the number of users that lean 'lib' on the v axis is far greater than those who lean 'auth'). It may be useful to consider alternative metrics such as bACC or PPCR.