/project_fletcher

Primary LanguageJupyter Notebook

Project Fletcher

Project Summary

Intro

Much has been made already of the Russian influence on the 2016 Presidential election. One tactic they employed was using bots on Twitter to spread divisive rhetoric as an attempt to create conflict in American politics. So for this project I set out with the goal of fighting proverbial fire with proverbial fire by developing a way of generating and promoting reasonable and effective discussion in the American political discourse.

Data

Luckily FiveThirtyEight recently released a massive corpus of tweets harvested from known Russian troll accounts from February 2012 to May 2018. I used a small fraction of the nearly 3 million available tweets to examine the tactics used by their troll bots. In order to get a contrasting data set, I looked to Reddit's /r/NeutralPolitics community -- a forum I have personally frequented and found to be well moderated to curate civil discussion around politics. Using the Reddit API, I gathered nearly 40,000 comments on the top posts all time from the community. However, due to computational limitations, my analysis only ended up using approximately 15000 documents (tweets for the trolls and comments from Reddit) for my analysis.

Analysis + Results

I took a few approaches to examining the data collected. One approach I took was to use Latent Dirichlet Association and look at the associated topics to get a sense of the structural makeup of these communities. The most interesting thing to note was that the topics for all the data combined was heavily dominated by troll tweets. Looking at the topics that came out of the Reddit data alone, there were several that would not come up at all in the combined data, including one apparently about healthcare typified by words like 'insurance' and 'preexisting' and one around the recent Republican tax bill typified by words like 'debt' and 'taxcuts'. The two groups shared topics about Trump and the Russian investigation (using surprisingly similar language like 'FBI', 'Russia', 'dossier', etc) and so unsurprisingly similar topics appeared in the combined data. However, most the other topics were clearly from the troll tweets such as few typified by words with cyrillic characters like 'что,' 'сша', and 'не.' There were some categories clearly related to the nature of Twitter typified by things like 'RT', 'Realdonaldtrump,' and 'blacklivesmatter.' The first to appear seemingly more rooted in the NeutralPolitics did not show up until the analysis was extend to 15 topics, and was clearly about net neutrality, typified by words like 'internet,' 'neutrality,' and 'isps.'

Additionally I ran Latent Semantic Analysis on the combined data set and using DBScan came up with some pretty clearly defined categories as shown by the TSNE visualization below:

image

However, initial exploration into what exactly was being captured by this clustering didn't reveal any obvious commonalities within clusters, so that is certainly something I would like to investigate further.

Finally, as an attempt to generate some rational discussion as the troll bots did for the divisive kind, I put together a simple Bayesian text generator based on the Neutral Politics data. Although not exactly coherent, I did find it match the tone of the community fairly well. Here's an example of the kind of output it generated:

“Case doesn’t improve no matter the target. The plea agreement, now please stand quietly for a big payout and I still take issue with it. Before we started privatizing and cutting out the fact Veselnitskaya was a natural monopoly, or information at the same time people if you own the name. Neutrality. Don’t imagine that the two who didn’t do that either. It feels good but this explanation is misleading to correlate the gun by bragging about how there could be a great place, so exactly the sort of speaking straight out of pocket, in Canada.”

Conclusions + Next Steps

All in all, the project was a fascinating dive into NLP techniques and the world of online political discourse. Going forward I would like to improve the data, the modeling, and the text generation. As mentioned earlier, I downsampled heavily in order to get some of the computationally expensive analysis to run on my 2010 laptop, and there is a whole lot more data from both sources to be taken into account. It seems to me with NLP that the more data you can throw at a problem, the more likely you are to gain some insight, so this would be a promising avenue. Additionally, I would like to find corresponding comunities across platforms -- perhaps I would gain more insight if I could find a realiable source of rational discussion on twitter and a source of divisive dialogue on Reddit. These certainly exist, and would hopefully make the results more robust. Also as mentioned above, I would like to look more into the LSA clustering that seemed to separate the data well but did not have obvious commonalities to my eye between groups. Ideally, I could also route this data into a supervised learning model that could differentiate between divisive and rational discourse and identify as such. For the text generation, I would love to investigate neural networks as a possible alternate approach to generate hopefully more coherent text.