This project was written by me during my machine learning course (CS 6350 at the University of Utah) in the fall of 2017.
The original repository containing this code is private, as it also contains solutions to other assignments from the class. I wanted to make this public, so I had to create this new repository for it.
The data comes from the 2011 paper "Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter" by Lee, Eoff, and Caverlee. It was pre-processed for us by the course staff, but the data is too large to upload (and I'm not sure if there are specific license issues with rehosting their data anyway).
My project performed okay, though not fantastically. There were many techniques which I could have tried to improve performance, and it's quite possible (and perhaps even likely) that plenty of what I did implement was done poorly or incorrectly. I think I spent too much time worrying about making a well-engineered solution because it was kind of fun. Oh well!
This was my first foray into numpy/scipy/scikitlearn, so I think I probably did not use those libraries to their fullest potential. If I were to do it all again, I would spend more time getting comfortable with them and using them more sufficiently.
To run the code, simply do python3 project.py -h
and you'll get all the
relevant information on how to run the thing. Hopefully that's enough!