Introduction project for Deep Learning Analytics Mandy Sack 2019
The goal of this project is to use a twitter dataset to determine if the an account is a bot or not using Tensorflow's DNNClassifier. Using Tensorflow with GPU will enable an increase in performance effeciency.
The data has already seperated data by bot, referred to as a "Content Polluter", and a not bot referred to as a "Legitimate User".
There has been a column added to the datasets labeled "Bad User" to identify this attribute prior to merging the datasets.
Bad User: 1 - Content Polluter 0 - Legitimate User
NumberOfFollowings
NumberOfFollowers
NumberOfTweets
LengthOfScreenName
LengthOfDescriptionInUserProfile
3 non-numeric features we're removed during the data cleasing as they did not provide information that would assist in proper classification
UserID
CreatedAt
CollectedAt
README.md - This file
Deep Learning Project.ipynb - Notebook that will run the DLA project
DataExploration.ipynb - Notebook that shows some of the commands used for exploring the dataset
data/content_polluters.csv - Provides content polluters twitter account information and is classified as a Bad User
data/legitimate_users.csv - Provides legitimate users twitter account information and is classified as a Non-Bad User
data/mergedData_classified.csv - Provides both content polluters & legitimate users twitter account information sorted by the CollectedAt column
data/trainingData_classified.csv - Provides the mergedData that has removed the 3 non-numerical columns as well as randomized for training purposes
docs/social_honeypot_icwsm_2011.pdf - Documentation regarding the data
docs/SevenMonthswiththeDevilsStudy.pdf - White paper that used the Twitter data
You will need python3 & anaconda to run this experiment. If you are going to run this without a GPU then you will want to modify the notebooks to only import tensorflow instead of tensorflow-gpu
Create an anaconda environment using the following commands:
$ conda create -n dla python=3.6 tensorflow-gpu numpy matplotlib pandas scipy
$ conda activate dla
If you are using a NVIDIA Jetson TX2, you will not be able to use conda. You will need to install all of the python modules seperately: tensorflow-gpu numpy matplotlib pandas scipy
There are two ways to run this experiment.
If you are going to use Jupyter Notebook, execute all cells in Deep Learning Project.ipynb
From the command line you can simply run:
$ ./demo
The last cell the Jupyter Notebook and the last output of the demo script provide the result of: [b'1']
Meaning that it did properly classify a "bot" from our content_polluters.csv file as a bot. You could take any row from either the content_polluters.csv file or legitimate_users.csv file to determine if the classifier is working.
The accuracy is currently at 89%, which is not great, and this would need this to be greatly improved to be above 95%.
Initially, the accuracy was at 86%, and some improvement was able to be reached using variable modifications.
The dataset used is caverlee-2011 from the website https://botometer.iuni.iu.edu/bot-repository/datasets.html
Reference: Lee, Kyumin, Brian David Eoff, and James Caverlee. "Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter." ICWSM. 2011.