/how-to-spot-a-russian-troll-tweet-mod-4-project

How to Spot a (Russian) Troll - Classifying Troll Tweets vs Authentic Tweets

Primary LanguageJupyter NotebookOtherNOASSERTION

How to Spot a (Russian) Troll

  • Student name: James M. Irving, Ph.D.
  • Student pace: Full Time
  • Instructor name: Brandon Lewis, Jeff Herman

Goal:

To create a deep-learning Natural Language Processing neural network that can analyze the language content of a Tweet to determine if the tweet was written by an authentic user or a Russian Troll.

Methods Overview:

Analysis Framework

  • We will be using the OSEMN framework (outlined in brief below)
    • Obtain
      • Extract new Twitter control tweets.
      • Merge with Kaggle/FiveThirtyEight dataset
    • Scrub
      • Remove hashtags, mentions, misleading common terms (discovered by feature importance in Decision Tree Classifier.
    • Explore
      • WordClouds for tokenized, stop-word removed text.
        • Also for hashtags, mentions
        • Bigrams & Pointwise Mutual Information Score
        • Sentiment Analysis
    • Model
      • Logistic Regression
      • Decision Tree Classifier
      • Random Forests Classifier
      • Artificial Neural Networks (3 vers)
    • Interpret
      • Summary
      • Future Directions

Data Sources

Russian Troll Tweets

We started with a dataset of 3 Million Tweets sent from 2,752 Twitter accounts connected to the "Internet Research Agency," a Russian-Troll farm that was part of The Mueller Investigation's February 2018 indictments. The tweets cover a range from February 2012 to May 2018, with the majority published from 2015-2017. > - The tweets published on Kaggle by FiveThirtyEight and were originally collected by Clemson University researchers, Dr. Darren Linvill and Dr. Patrick Warren, using custom searches using Social-Studio software licensed by Clemson University.

> - Their analysis on the various tactics used and the various social groups that were targeted by the trolls is detailed in their manuscript "Troll Factories: The Internet Research Agency and State-Sponsored Agenda Building" published in July of 2018.

Appropriate Control Tweets

However, since the goal is to produce a machine learning model that can accurately classify if a tweet came from an authentic user or a Russian troll, we needed to acquire a set of control tweets from non-trolls.

  • We used Tweepy to to extract a set of control tweets from the twitter api. - Our goal was to extract tweets from the same time period as the troll tweet dataset, matching the top 20 hashtags(#) and mentions(@).

    - However, due to limitations of the Twitter API (including the inability to search for specific date ranges) we extracted present-day tweets directed at the top 40 most-frequent mentions (@s) from the troll Tweets.
    - (The top 20 hashtags contained many generic topics (#news, #sports) that would not be proper controls for the language content of the troll tweets.)

  • Our newly extracted control dataset is comprised of:
    • 39,086 tweets ranging from 05/24/19 to 06/03/19.

  • We do not have equal number of troll tweets and new tweets
  • We will resample from the troll tweets to meet the # of new controls.

Original Troll Dataset Features & Observations

For our analyses, we will be focusing strictly on the language of the tweets, and none of the other characteristics int he dataset.

Observations on the Troll Tweet Dataset

  • Dataset is comprised of 2,973,371 tweets. - Target Tweet Text to Analyze is in Content
  • Thoughts on specific features:
    • language
      • There are 56 unique languages.
      • 2.1 million are English (71.6%), 670 K are in Russian, etc.
      • Drop all non-English tweets.
    • retweet
      • 1.3 million entries are retweets (44.1 % )
      • Since this analysis will use language content to predict author, retweets are not helpful.
      • Retweets were not written by the account's author and should be not considered.
      • Drop all retweets
  • Final Troll Tweet Summary:
    • After dropping non-English tweets and retweets, there are 1,272,848 Russian-Troll tweets.

OBTAIN:

Control Tweet Extract with TwitterAPI

  • See Notebook "student_JMI_twitter_extraction.ipynb" for the extraction of control tweets to match the Troll tweets.

Combining & Sampling Original Troll Tweets and with Newly-Harvested Control Tweets

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
external_author_id author content region language publish_date following followers updates post_type account_type retweet account_category troll_tweet
date_published
2017-10-01 19:58:00 9.060000e+17 10_GOP "We have a sitting Democrat US Senator on tria... Unknown English 10/1/2017 19:58 1052 9636 253 NaN Right 0 RightTroll 1
2017-10-01 22:43:00 9.060000e+17 10_GOP Marshawn Lynch arrives to game in anti-Trump s... Unknown English 10/1/2017 22:43 1054 9637 254 NaN Right 0 RightTroll 1
2017-10-01 23:52:00 9.060000e+17 10_GOP JUST IN: President Trump dedicates Presidents ... Unknown English 10/1/2017 23:52 1062 9642 256 NaN Right 0 RightTroll 1
There were 832208 unique hashtags and 673442 unique @'s
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
tags % Total
#news 118624 14.254129
#sports 45544 5.472670
#politics 37452 4.500317
#world 27077 3.253634
#local 23130 2.779353
#TopNews 14621 1.756893
#health 10328 1.241036
#business 9558 1.148511
#BlackLivesMatter 8252 0.991579
#tech 7836 0.941592
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
ats % Total
@midnight 6691 0.993553
@realDonaldTrump 3532 0.524470
@WarfareWW 1529 0.227043
@CNN 1471 0.218430
@HillaryClinton 1424 0.211451
@POTUS 1035 0.153688
@CNNPolitics 948 0.140769
@FoxNews 930 0.138097
@mashable 740 0.109883
@YouTube 680 0.100974
(None, None)

SCRUB:

Text Processing: tokenizing, removing stopwords, urls, hashtags

Using RegularExpressions to extract and replace Urls, Hashtags, and Mentions

  • URLs, hasthtags, mentions were already removed.
    • hashtags and mentions are in content_hashtags,content_mentions
  • Cleaned data columns are:
    • content_min_clean: only RT headers and urls have been removed.
    • content: RT headers, urls, hashtags, mentions were all removed.

ADDRESSING THE "RT" ISSUE

  • Summary: many messages in both the troll tweets and new control tweets are prepended with "RT".
    • Some are RT @handle: others are just RT handle:, but SOME are using RT internally as an abbreviation for the idea of a retweet, not as an indicator that the current message is a retweet.

  • Reference Article from 2014 re: the usage of RT and more modern methods of reatweeting.

  • According to this article, using "RT @Username:" in an anachronism that has been replaced by an automated retweet function, but that many die-hard old school twitter users still prefer to do it the manual way.

  • Therefore, by filtering out tweets marked as officially retweeted by the Twitter API, but keeping tweets thats were manually retweeted using "RT @handle:", can produce a sort of sampling error.

    • That being said, if someone is taking the effort to manually type the origin and quote, that level of effort, in my opinion, still is a reasonable metric to use to separate out these tweets from the auto-retweeted tweets.
  • HOWEVER: there is still the issue of what effect this has on the dataset.

  • My proposed solution:

    1. take current df['content'] column and change it to df['content_raw'] (so that it is no longer used by the following code)
    2. create a new df['content_min_clean'] column that uses regexp to remove all RT @handle: and RT handle: from the df['raw'] column and all URL links, but keeps mentions and hashtags
      • Use for some vectorized analyses
    3. Create a new hashtags and mentions columns that will find and save any handles and hashtags from anywhere in the NEW RT @mention:-removed content.
      • This means that any hashtags that represent the original source and are pre-pended to the message will NOT be included, but any other mentions WILL be included.

Resampling Troll Tweets to Match Number of Control Tweets

  • Since there are many fewer new tweets, we will sample the same # from the larger Troll tweet collection.
  • An issue to be reconsidered in future analyses is how to resample in a way that ensures that chosen troll tweets will be as close to the control tweets as the dataset allows.
    • In other words, making sure that if a term appears in the new control tweets, that we purposefully include matching tweets in our resampled troll tweets.

Revisiting Data to be Included/Excluded from the Analysis.

  • This may or may not be necessary, so saving it as markdown for now until revisiting the word frequency results.
## TO CHECK FOR STRINGS IN TWO DATAFRAMES:
def check_dfs_for_exp_list(df_controls, df_trolls, list_of_exp_to_check):
    import bs_ds as bs
    list_of_results=[['Term','Control Tweets','Troll Tweets']]
    for exp in list_of_exp_to_check:
        num_control = len(df_controls.loc[df_controls['content_min_clean'].str.contains(exp)])
        num_troll = len(df_trolls.loc[df_trolls['content_min_clean'].str.contains(exp)])
        list_of_results.append([exp,num_control,num_troll])
    df_results = bs.list2df(list_of_results, index_col='Term')
    return df_results

## TO CHECK FOR STRINGS IN TWO GROUPS FROM ONE DATAFRAME 
def check_df_groups_for_exp(df_full, list_of_exp_to_check, check_col='content_min_clean', groupby_col='troll_tweet', group_dict={0:'Control',1:'Troll'}):      
    """Checks `check_col` column of input dataframe for expressions in list_of_exp_to_check and 
    counts the # present for each group, defined by the groupby_col and groupdict. 
    Returns a dataframe of counts."""
    import bs_ds as bs
    list_of_results = []      

    header_list= ['Term']
    [header_list.append(x) for x in group_dict.values()]
    list_of_results.append(header_list)
    
    for exp in list_of_exp_to_check:
        curr_exp_list = [exp]
        
        for k,v in group_dict.items():
            df_group = df_full.groupby(groupby_col).get_group(k)
            curr_group_count = len(df_group.loc[df_group[check_col].str.contains(exp)])
            curr_exp_list.append(curr_group_count)
        
        list_of_results.append(curr_exp_list)
        
    df_results = bs.list2df(list_of_results, index_col='Term')
    return df_results
## CHECKING WORD OCCURANCES 
# Important Features from Decision Tree Classificaiton: verify if they are present in Troll and Controll Tweets
list_of_exp_to_check = ['[Pp]eggy','[Nn]oonan','[Mm]exico','nasty','impeachment','[mM]ueller']
df_compare = check_df_groups_for_exp(df_full, list_of_exp_to_check)
The troll_tweet classes are imbalanced.
There are 1272847 troll tweets and 39086 control tweets
## REMOVE MISLEADING FREQUENT TERMS
import bs_ds as bs
# Removing Peggy Noonan since she was one of the most important words and theres a recent news event about her
list_to_remove =['[Pp]eggy','[Nn]oonan']

for exp in list_to_remove:
    df_full['content'].loc[df_full['content_min_clean'].str.contains(exp)]=np.nan
    df_full.dropna(subset=['content'],inplace=True)

print("New Number of Control Tweets=",len(df_full.loc[df_full['troll_tweet']==0]))
print(f"New Number of Troll Tweets=",len(df_full.loc[df_full['troll_tweet']==1]))
# Re-check for list of expressions
df_compare = check_df_groups_for_exp(df_full, list_of_exp_to_check)
df_compare.style.set_caption('Full Dataset Expressions')
New Number of Control Tweets= 38094
New Number of Troll Tweets= 1272760

EXPLORE

Generating Frequency Distribtuions

import nltk
import string
from nltk import word_tokenize
def get_group_texts_tokens(df_small, groupby_col='troll_tweet', group_dict={0:'controls',1:'trolls'}, column='content_stopped'):
    from nltk import regexp_tokenize
    pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
    text_dict = {}
    for k,v in group_dict.items():
        group_text_temp = df_small.groupby(groupby_col).get_group(k)[column]
        group_text_temp = ' '.join(group_text_temp)
        group_tokens = regexp_tokenize(group_text_temp, pattern)
        text_dict[v] = {}
        text_dict[v]['tokens'] = group_tokens
        text_dict[v]['text'] =  ' '.join(group_tokens)
            
    print(f"{text_dict.keys()}:['tokens']|['text']")
    return text_dict
# Function will return a dictionary of all of the text and tokens split by group
TEXT = get_group_texts_tokens(df_small,groupby_col='troll_tweet', group_dict={0:'controls',1:'trolls'}, column='content_stopped')

# TEXT[Group][Text-or-Tokens]
TEXT['trolls']['tokens'][:10]
dict_keys(['controls', 'trolls']):['tokens']|['text']





['building',
 'collapses',
 'mexico',
 'city',
 'following',
 'magnitude',
 'earthquake',
 'current',
 'scene',
 'mexico']
from nltk import FreqDist

TEXT = get_group_texts_tokens(df_small)

freq_trolls = FreqDist(TEXT['trolls']['tokens'])
freq_controls = FreqDist(TEXT['controls']['tokens'])

df_compare=pd.DataFrame()
df_compare['Troll Words'] = freq_trolls.most_common(25)
df_compare['Control Words'] = freq_controls.most_common(25)
display(df_compare)

# print(freq_controls.most_common(50))
dict_keys(['controls', 'trolls']):['tokens']|['text']
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Troll Words Control Words
0 (mueller, 3919) (trump, 4053)
1 (trump, 3893) (want, 2851)
2 (mexico, 3108) (mexico, 2843)
3 (new, 2091) (years, 2828)
4 (police, 1210) (big, 2598)
5 (breaking, 1200) (talk, 2494)
6 (workout, 1127) (border, 2440)
7 (man, 1122) (problem, 2365)
8 (obama, 889) (president, 2360)
9 (says, 751) (people, 2235)
10 (state, 738) (talking, 2219)
11 (president, 718) (sending, 2031)
12 (video, 715) (delegation, 2015)
13 (us, 679) (like, 1943)
14 (people, 669) (news, 1542)
15 (one, 661) (know, 1538)
16 (clinton, 628) (us, 1487)
17 (hillary, 592) (never, 1375)
18 (russia, 567) (one, 1372)
19 (get, 547) (nasty, 1326)
20 (killed, 524) (would, 1198)
21 (robert, 524) (today, 1148)
22 (look, 514) (get, 1033)
23 (congress, 508) (need, 1019)
24 (year, 496) (said, 1004)

png

png

# pause

Generating WordClouds

png

Wordclouds for Hashtags

png

Wordclouds for @'s

png

Creating Bigrams and PMI scores

Bigrams

Control Tweet Bigrams
Frequency
Bigram
('border', 'problem') 0.00571695
('big', 'delegation') 0.00569424
('talking', 'years') 0.0056914
('sending', 'big') 0.00568572
('talk', 'border') 0.00568572
('years', 'want') 0.00568572
('delegation', 'talk') 0.00568288
('mexico', 'sending') 0.00568288
('problem', 'talking') 0.00568288
('fake', 'news') 0.00184509
('meghan', 'markle') 0.00183658
('markle', 'nasty') 0.00167761
('never', 'called') 0.00156975
('called', 'meghan') 0.00155555
('want', 'mexico') 0.00131143
('news', 'media') 0.00120641
('got', 'caught') 0.00118086
('caught', 'cold') 0.00117518
('made', 'fake') 0.00116383
('president', 'trump') 0.00116099
('media', 'got') 0.00115815
('nasty', 'made') 0.00115815
('eyes', 'ears') 0.00112693
('pres', 'trump') 0.00111841
('evidence', 'eyes') 0.00111557
 <style type="text/css" > </style>Troll Tweet Bigrams
Frequency
Bigram
('new', 'mexico') 0.00338416
('robert', 'mueller') 0.00168116
('witch', 'hunt') 0.00112805
('crooked', 'mueller') 0.000967941
('donald', 'trump') 0.000949747
('year', 'old') 0.000949747
('president', 'trump') 0.000822386
('ignores', 'mueller') 0.000785998
('grand', 'jury') 0.000775081
('coup', 'using') 0.000753248
('approval', 'tanks') 0.000724137
('tanks', 'gt') 0.000724137
('mueller', 'credibility') 0.000698665
('rosenstein', 'mueller') 0.000698665
('deep', 'state') 0.000651359
('lose', 'weight') 0.000644081
('special', 'counsel') 0.000618609
('credibility', 'think') 0.00058222
('mexico', 'border') 0.000571304
('white', 'house') 0.000567665
('attempts', 'coup') 0.000556748
('gt', 'attempts') 0.000556748
('mueller', 'witch') 0.000505804
('hillary', 'clinton') 0.000491248
('crimes', 'obama') 0.000473054

Pointwise Mutual Information Score

  • Interesting, but heavily influenced by the different time periods.
Control Tweets
PMI Score
Bigrams
('disappearance', 'connecticut') 16.1045
('glen', 'tyrone') 16.1045
('uscis', 'ignor') 16.1045
('advisers', 'departing') 15.8414
('babyhands', 'mcgrifter') 15.8414
('bryan', 'stevenson') 15.8414
('computer', 'intrusions') 15.8414
('grounding', 'airline') 15.8414
('haberman', 'sycophancy') 15.8414
('intimidating', 'construc') 15.8414
('rio', 'grande') 15.8414
('capone', 'vault') 15.619
('hs', 'bp') 15.619
('partnership', 'racing') 15.619
('racing', 'airs') 15.619
('riskier', 'bureaucracy') 15.619
('rweet', 'apprec') 15.619
('unprepared', 'temperamentally') 15.619
('vr', 'arcade') 15.619
('bites', 'dust') 15.5784
 <style type="text/css" > </style>Troll Tweets
PMI Score
Bigrams
('cessation', 'hostilities') 15.7461
('dunkin', 'donuts') 15.7461
('lena', 'dunham') 15.7461
('notre', 'dame') 15.7461
('snoop', 'dogg') 15.7461
('boko', 'haram') 15.4831
('lectric', 'heep') 15.4831
('nagorno', 'karabakh') 15.4831
('kayleigh', 'mcenany') 15.2607
('otto', 'warmbier') 15.2607
('allahu', 'akbar') 15.0681
('elon', 'musk') 15.0681
('ez', 'zor') 15.0681
('peanut', 'butter') 15.0681
('palo', 'alto') 14.8981
('tomi', 'lahren') 14.8981
('trey', 'gowdy') 14.8981
('betsy', 'devos') 14.8457
('caitlyn', 'jenner') 14.7753
('cranky', 'senile') 14.7461

Sentiment Analysis with VADER

png

RESULTS OF SENTIMENT ANALYSIS BINARY CLASSIFICATION:
 ------------------------------------------------------------
	Normalized Troll Classes:
 pos    0.613535
neg    0.386465
Name: sentiment_class, dtype: float64

	Normalized Control Classes:
 pos    0.598126
neg    0.401874
Name: sentiment_class, dtype: float64

png

png

png

MODEL

BUILDING THE INITIAL MODELS

Logistic Regression

Summary Table of Clocked Processes
Lap # Start Time Duration Label
TOTAL 09/11/19 - 12:00:27 PM 0 min, 0.945 sec LogisticRegression complete.
               precision    recall  f1-score   support

           0       0.88      0.88      0.88      7552
           1       0.88      0.88      0.88      7551

   micro avg       0.88      0.88      0.88     15103
   macro avg       0.88      0.88      0.88     15103
weighted avg       0.88      0.88      0.88     15103

png

Train Accuracy:  0.9390843738176314
Test Accuracy:  0.8788320201284513

INITIAL MODEL SUMMARY: Logistic Regression

  • Accuracy:

    • 0.938 for train set
    • 0.875 for test set
  • Recall/Precision/F1-scores all around 0.87

  • Duration:

    • 0.78 sec

DecisionTreeClassifier

Summary Table of Clocked Processes
Lap # Start Time Duration Label
TOTAL 09/11/19 - 12:00:28 PM 0 min, 20.204 sec

png

Train Accuracy:  0.9390843738176314
Test Accuracy:  0.8273852876911871

Plotting Feature Importance

png

Random Forests Classifier

Summary Table of Clocked Processes
Lap # Start Time Duration Label
TOTAL 09/11/19 - 12:00:50 PM 2 min, 36.929 sec Accuracy:0.8273852876911871
              precision    recall  f1-score   support

           0       0.79      0.94      0.86      7552
           1       0.93      0.74      0.83      7551

   micro avg       0.84      0.84      0.84     15103
   macro avg       0.86      0.84      0.84     15103
weighted avg       0.86      0.84      0.84     15103

png

Train Accuracy:  0.9390843738176314
Test Accuracy:  0.8428126862212806

Keras Model 1: creating a Text Classification Neural Network in Keras

  • Using CountVectorized data generated above
from keras import models, layers, optimizers
input_dim = X_train.shape[1]
# input_dim = sequences_train.shape[1]
print(input_dim)

model1 = models.Sequential()
# model.add(layers.Embedding)

model1.add(layers.Dense(10, input_dim= input_dim, activation='relu'))
model1.add(layers.Dense(1, activation='sigmoid'))

model1.compile(loss='binary_crossentropy',optimizer="adam",metrics=['accuracy'])
model1.summary()
30191
WARNING:tensorflow:From C:\Users\james\Anaconda3\envs\learn-env-ext\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 10)                301920    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
=================================================================
Total params: 301,931
Trainable params: 301,931
Non-trainable params: 0
_________________________________________________________________



--- CLOCK STARTED @:    09/11/19 - 12:03:27 PM           Label: starting keras .fit --- 
WARNING:tensorflow:From C:\Users\james\Anaconda3\envs\learn-env-ext\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 52860 samples, validate on 7552 samples
Epoch 1/10
52860/52860 [==============================] - 22s 407us/step - loss: 0.5001 - acc: 0.8236 - val_loss: 0.3608 - val_acc: 0.8567
Epoch 2/10
52860/52860 [==============================] - 20s 384us/step - loss: 0.2977 - acc: 0.8812 - val_loss: 0.2973 - val_acc: 0.8710
Epoch 3/10
52860/52860 [==============================] - 20s 387us/step - loss: 0.2340 - acc: 0.9076 - val_loss: 0.2808 - val_acc: 0.8775
Epoch 4/10
52860/52860 [==============================] - 22s 414us/step - loss: 0.1975 - acc: 0.9242 - val_loss: 0.2770 - val_acc: 0.8799
Epoch 5/10
52860/52860 [==============================] - 23s 431us/step - loss: 0.1719 - acc: 0.9350 - val_loss: 0.2805 - val_acc: 0.8792
Epoch 6/10
52860/52860 [==============================] - 22s 417us/step - loss: 0.1526 - acc: 0.9427 - val_loss: 0.2868 - val_acc: 0.8774
Epoch 7/10
52860/52860 [==============================] - 25s 470us/step - loss: 0.1375 - acc: 0.9483 - val_loss: 0.2953 - val_acc: 0.8771
Epoch 8/10
52860/52860 [==============================] - 23s 439us/step - loss: 0.1251 - acc: 0.9538 - val_loss: 0.3057 - val_acc: 0.8770
Epoch 9/10
52860/52860 [==============================] - 22s 425us/step - loss: 0.1147 - acc: 0.9573 - val_loss: 0.3180 - val_acc: 0.8766
Epoch 10/10
52860/52860 [==============================] - 24s 453us/step - loss: 0.1062 - acc: 0.9615 - val_loss: 0.3311 - val_acc: 0.8746
--- TOTAL DURATION   =  3 min, 43.946 sec --- 
Summary Table of Clocked Processes
Lap # Start Time Duration Label
TOTAL 09/11/19 - 12:03:27 PM 3 min, 43.946 sec completed 10 epochs
52860/52860 [==============================] - 19s 366us/step
Training Accuracy:0.9676125614696319
15103/15103 [==============================] - 5s 355us/step
Testing Accuracy:0.8750579354542498

png

Summary - Neural Network Analysis on CountVectorized Tweets

  • Accuracy:
    • Training: 0.968
    • Testing:0.880
  • Run time:
    • 2:48 min

Keras Model 2 Adding pretrained Word2Vec embeddings

  • Adding Word2Vec vectorization into an embedding layer

How to get a word2vector model's vectors into a Keras Embedding Layer

text_data = df_tokenize['content']
from gensim.models import Word2Vec
vector_size = 300

wv_keras = Word2Vec(text_data, size=vector_size, window=10, min_count=1, workers=4)
wv_keras.train(text_data,total_examples=wv_keras.corpus_count, epochs=10)

wv = wv_keras.wv
vocab_size = len(wv_keras.wv.vocab)
print(f'There are {vocab_size} words in the word2vec vocabulary, with a vector size {vector_size}.')
There are 801 words in the word2vec vocabulary, with a vector size 300.

https://adventuresinmachinelearning.com/word2vec-keras-tutorial/ https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

# save the vectors in a new matrix
word_model = wv_keras
vector_size = word_model.wv.vectors[1].shape[0]

embedding_matrix = np.zeros((len(word_model.wv.vocab) + 1, vector_size))
for i, vec in enumerate(word_model.wv.vectors):
  embedding_matrix[i] = vec
# Get list of texts to be converted to sequences
# sentences_train =text_data # df_tokenize['tokens'].values
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=len(wv.vocab))
tokenizer.fit_on_texts(list(text_data)) #tokenizer.fit_on_texts(text_data)

word_index = tokenizer.index_word
reverse_index = {v:k for k,v in word_index.items()}
# return integer-encoded sentences
from keras.preprocessing import text, sequence
X = tokenizer.texts_to_sequences(text_data)
X = sequence.pad_sequences(X)

y = df_tokenize['troll_tweet'].values
# reverse_index
X_train, X_test, X_val, y_train, y_test, y_val = train_test_val_split(X, y)#, test_size=0.1, shuffle=False)
model2 = models.Sequential()

model2.add(layers.Embedding(len(wv_keras.wv.vocab)+1,
                             vector_size,input_length=X_train.shape[1],
                             weights=[embedding_matrix],trainable=False)) 
          
model2.add(layers.LSTM(300, return_sequences=False))#, kernel_regularizer=regularizers.l2(.01)))
# model1B.add(layers.GlobalMaxPooling1D())
model2.add(layers.Dense(10, activation='relu'))
model2.add(layers.Dense(1, activation='sigmoid'))

model2.compile(loss='binary_crossentropy',optimizer="adam",metrics=['accuracy'])
model2.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 51, 300)           240600    
_________________________________________________________________
lstm_1 (LSTM)                (None, 300)               721200    
_________________________________________________________________
dense_3 (Dense)              (None, 10)                3010      
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
=================================================================
Total params: 964,821
Trainable params: 724,221
Non-trainable params: 240,600
_________________________________________________________________
num_epochs = 5
history = model2.fit(X_train, y_train, epochs=num_epochs, verbose=True, validation_data=(X_val, y_val), batch_size=500)
--- CLOCK STARTED @:    09/11/19 - 12:08:20 PM           Label: starting keras .fit --- 
Train on 52860 samples, validate on 7552 samples
Epoch 1/5
52860/52860 [==============================] - 131s 2ms/step - loss: 0.4562 - acc: 0.7795 - val_loss: 0.3929 - val_acc: 0.8124
Epoch 2/5
52860/52860 [==============================] - 127s 2ms/step - loss: 0.3790 - acc: 0.8203 - val_loss: 0.3764 - val_acc: 0.8173
Epoch 3/5
52860/52860 [==============================] - 127s 2ms/step - loss: 0.3503 - acc: 0.8348 - val_loss: 0.3578 - val_acc: 0.8264
Epoch 4/5
52860/52860 [==============================] - 120s 2ms/step - loss: 0.3292 - acc: 0.8464 - val_loss: 0.3528 - val_acc: 0.8289
Epoch 5/5
52860/52860 [==============================] - 122s 2ms/step - loss: 0.3068 - acc: 0.8568 - val_loss: 0.3420 - val_acc: 0.8371
--- TOTAL DURATION   =  10 min, 28.585 sec --- 
Summary Table of Clocked Processes
Lap # Start Time Duration Label
TOTAL 09/11/19 - 12:08:20 PM 10 min, 28.585 sec completed 5 epochs
loss, accuracy = model2.evaluate(X_train, y_train, verbose=True)
print(f'Training Accuracy:{accuracy}')

loss, accuracy = model2.evaluate(X_test, y_test, verbose=True)
print(f'Testing Accuracy:{accuracy}')
52860/52860 [==============================] - 70s 1ms/step
Training Accuracy:0.8670450245977757
15103/15103 [==============================] - 20s 1ms/step
Testing Accuracy:0.8459908626911964
jmi.plot_keras_history(history)

png

(<Figure size 432x576 with 2 Axes>,
 array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001FAF83FAA58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001FA851B2F28>],
       dtype=object))
model2.save('model2_emb_lstm_dense_dense.hd5',include_optimizer=True, overwrite=True)
model2.save_weights('model2_emb_lstm_dense_dense_WEIGHTS.hdf')

Keras model 3 -using keras' tokenizer to fit_on_texts+one_hot

Train, test, val split

# from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer, one_hot
from keras.utils.np_utils import to_categorical

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from keras import models, layers, optimizers

# df_tokenize.head()
# Define tweets to be analyzed, fit tokenizer,generate sequences
tweets = df_tokenize['content']
# num_words=len(set(tweets))
tokenizer = Tokenizer(num_words=3000)

tokenizer.fit_on_texts(tweets)
sequences = tokenizer.texts_to_sequences(tweets)
one_hot_results = tokenizer.texts_to_matrix(tweets, mode='binary')

word_index = tokenizer.word_index
reverse_index = {v:k for k,v in word_index.items()}

Test Train Split

print(one_hot_results.shape, y.shape)
(75515, 3000) (75515,)
import random, math
random.seed(42)
test_size = math.floor(one_hot_results.shape[0]*0.3)
test_index = random.sample(range(1,one_hot_results.shape[0]), test_size)

test = one_hot_results[test_index]
train = np.delete(one_hot_results, test_index, 0)
label_test = y[test_index]
label_train = np.delete(y, test_index, 0)
train.shape
train.shape[1]
3000
model3 = models.Sequential()
model3.add(layers.Dense(50, activation='relu', input_shape=(3000,)))
model3.add(layers.Dense(25, activation='relu'))
model3.add(layers.Dense(1,activation='sigmoid'))

model3.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])
model3.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_5 (Dense)              (None, 50)                150050    
_________________________________________________________________
dense_6 (Dense)              (None, 25)                1275      
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 26        
=================================================================
Total params: 151,351
Trainable params: 151,351
Non-trainable params: 0
_________________________________________________________________
clock_1hot = bs.Clock()
clock_1hot.tic()

history = model3.fit(train, label_train, epochs=10, batch_size=256, validation_data=(test, label_test))
clock_1hot.toc('')
--- CLOCK STARTED @:    09/11/19 - 12:20:30 PM --- 
Train on 52861 samples, validate on 22654 samples
Epoch 1/10
52861/52861 [==============================] - 4s 84us/step - loss: 0.3724 - acc: 0.8285 - val_loss: 0.2785 - val_acc: 0.8753
Epoch 2/10
52861/52861 [==============================] - 3s 61us/step - loss: 0.2456 - acc: 0.8929 - val_loss: 0.2620 - val_acc: 0.8865
Epoch 3/10
52861/52861 [==============================] - 3s 64us/step - loss: 0.2059 - acc: 0.9115 - val_loss: 0.2484 - val_acc: 0.8947
Epoch 4/10
52861/52861 [==============================] - 3s 66us/step - loss: 0.1648 - acc: 0.9309 - val_loss: 0.2504 - val_acc: 0.8976
Epoch 5/10
52861/52861 [==============================] - 3s 66us/step - loss: 0.1296 - acc: 0.9476 - val_loss: 0.2656 - val_acc: 0.8973
Epoch 6/10
52861/52861 [==============================] - 3s 61us/step - loss: 0.1009 - acc: 0.9606 - val_loss: 0.2867 - val_acc: 0.8964
Epoch 7/10
52861/52861 [==============================] - 3s 60us/step - loss: 0.0773 - acc: 0.9711 - val_loss: 0.3151 - val_acc: 0.8963
Epoch 8/10
52861/52861 [==============================] - 3s 60us/step - loss: 0.0594 - acc: 0.9782 - val_loss: 0.3539 - val_acc: 0.8965
Epoch 9/10
52861/52861 [==============================] - 3s 60us/step - loss: 0.0465 - acc: 0.9836 - val_loss: 0.3810 - val_acc: 0.8964
Epoch 10/10
52861/52861 [==============================] - 3s 61us/step - loss: 0.0367 - acc: 0.9873 - val_loss: 0.4169 - val_acc: 0.8959
--- TOTAL DURATION   =  0 min, 34.478 sec --- 
<style type="text/css" > #T_27140d9e_d4b0_11e9_bdbe_f48e38b6371f table, th { text-align: center; } #T_27140d9e_d4b0_11e9_bdbe_f48e38b6371frow0_col1 { width: 140px; } #T_27140d9e_d4b0_11e9_bdbe_f48e38b6371frow0_col2 { width: 140px; }</style>
            <tr>
                            <td id="T_27140d9e_d4b0_11e9_bdbe_f48e38b6371frow0_col0" class="data row0 col0" >TOTAL</td>
                    <td id="T_27140d9e_d4b0_11e9_bdbe_f48e38b6371frow0_col1" class="data row0 col1" >09/11/19 - 12:20:30 PM</td>
                    <td id="T_27140d9e_d4b0_11e9_bdbe_f48e38b6371frow0_col2" class="data row0 col2" >0 min, 34.478 sec</td>
                    <td id="T_27140d9e_d4b0_11e9_bdbe_f48e38b6371frow0_col3" class="data row0 col3" ></td>
        </tr>
</tbody></table>
loss, accuracy = model3.evaluate(train, label_train, verbose=True)
print(f'Training Accuracy:{accuracy}')

loss, accuracy = model3.evaluate(test, label_test, verbose=True)
print(f'Testing Accuracy:{accuracy}')
52861/52861 [==============================] - 2s 35us/step
Training Accuracy:0.990976334149251
22654/22654 [==============================] - 1s 37us/step
Testing Accuracy:0.8959124216579155

png

INTERPRET

  • Summary:
    • In terms of efficiency, no model can be a simple Logistic Regression.
    • Decision Trees and Random Forests did not improve performance and took significantly longer.
    • In terms of accuracy, a neural network using CountVectorization with a 3 layers of neurons outperformed all other models with 90% accuracy on the testing data with a run time of 31 seconds.

- **Caveats:** - Perfect control tweets were not available due to the limitations of the twitter API. If we had access to the batch historical tweets, we may be able to better classify troll tweets, as we would be able to leave the hashtags and mentions in the body of the tweet for vectorization. - There is the possibility that the accuracy tested as-is would decrease, due to elimination of any contemporaneous events that influence tweet contents.

FUTURE DIRECTIONS

  • With additional time, we would have explored additional Neural Network configurations using bi-directional layers and additional Dense layers for classification.

  • Additional methods of words/sentence vectorization

  • Analysis using Named Entity Recognition with Spacey

  • Additional Visualization

  • Using the outputs of the logistic regression or neural networks with model stacking

    • Adding in the other non-language characteristics of the tweets to further improve accuracy.
Summary Table of Clocked Processes
Lap # Start Time Duration Label