/Troll-Talk

A contribution to NaNoGenMo 2018

Primary LanguagePython

Troll Talk

A contribution to NaNoGenMo 2018

Earlier this summer FiveThirtyEight shared a corpus of nearly three million tweets associated with accounts linked to Russia's Internet Research Agency. The evidence suggests these tweets were part of a campaign to influence the 2016 US election. What was communicated, and how do we make sense of it?

One possibility is to simulate a conversation among the trolls using a word embedding model and tf-idf transforms.

  • Build an embedding model of all the words in the Russian troll tweets corpus. This will enable the use of Gensim's Word2Vec module, specifically the most_similar function which can generate analogies for each word in a given text with a pair of pre-selected words (such as liberal and conservative).
  • Transform the corpus of tweets into a tf-idf matrix.
  • Implement the following algorithm until 50,000 words have printed, beginning with a randomly selected tweet.
    • Print the tweet.
    • Remove the tf-idf vector for the tweet from the matrix (this avoids repetition).
    • Replace each word in the tweet by analogy with the word pair and the embedding model.
    • Print the modified tweet.
    • Transform the modified tweet as a tf-idf vector based on the structure of the matrix.
    • Select the tweet for which the vector in the matrix is most similar to the vector of the modified tweet (using cosine similarity).
    • Repeat.