This repository maintains the required codes for training CNN models used in Twitter Election Classification.
Our code has a number of Python dependencies with particular versions. We recommend use of the Anaconda distribution of Python, and use of the environment.yml
to initialise the dependencies.
Use the terminal or an Anaconda Prompt for the following steps:
- Create the environment from the
environment.yml
file:
conda env create -f environment.yml
- Verify that the new environment was installed correctly:
conda env list
- Activate the new environment:
conda activate myenv
You can also check conda documentation for creating environment from yaml file.
NOTE: Before running any command in this instruction, please make sure the environment has been activated and you are in the Twitter-Election-Classification
folder.
This section shows how to replicate the CNN and SVM results.
Tweets can be downloaded using Twitter API using the provided script data_replicate.py
. Before running the script, you need to configure the Twitter API consumer_token
, consumer_secret
, access_token
, access_secret
variables in lib/twitterAPI.json
for accessing the Twitter API.
For more information about accessing Twitter API, please check info about Twitter API access.
https://developer.twitter.com/en/apply-for-access.html
Open lib/twitterAPI.json
and copy paste your consumer_token
, consumer_secret
, access_token
and access_secret
into
{"consumer_token": YourConsumerToken, "consumer_secret": YourConsumerSecret, "access_token": YourAccessToken, "access_secret": YourAccessSecret}
In addtion to Twitter API access, you need tweet IDs of the EV dataset, which can be accessed separately.
The EV dataset is in csv format (shown in the example below):
tid,uid,handler,election,violence,date
796303280896,1048772612,CDEEEE,yes,violence,2016-11-01
798360141124,2283746327,ABCCCC,yes,no,unknown
796269422126,1427364627,CBAAAA,no,no,unknown
813424462817,1293837483,EDCCCC,no,no,unknown
...
Once the EV dataset is downloaded, run
for downloading and pre-processing tweets of Ghana dataset:
python data_replicate.py --data_path /path/to/ghana/dataset/csv/file --name gh
for downloading and pre-processing tweets of Philippines dataset:
python data_replicate.py --data_path /path/to/philippines/dataset/csv/file --name ph
for downloading and pre-processing tweets of Venezuela dataset:
python data_replicate.py --data_path /path/to/venezuela/dataset/csv/file --name vz
Tweets will be automatically downloaded from Twitter and processed.
Tweets before pre-processing are saved in folder download/raw
.
Pre-processed tweets are saved in folder download/processed
.
To obtain the results of CNN models, run:
python cnn_replicate.py
To obtain the results of SVM models, run
python svm_replicate.py
Results will be printed on the screen.
Due to the randomness of weight initialization and data availability (e.g. tweets deleted by Twitter user result in fewer data), the results may vary slightly.
Print the basic statistics of tweet dataset:
python tweet_stats_replicate.py
Plot the word2vec embedding in 2D, showing the feature of word2vec that similar words are close to each other in the embedding space.
python 2d_embed_replicate.py
Plot the clustering analysis by varying the K parameter:
python clustering_k_replicate.py
The new Twitter dataset should have the same csv format as the processed tweets described in downloads/README.md
. Train new models are straightforward, just run cnn_train.py
and provide necessary parameters.
To check all the available parameters, run python cnn_train.py -h
.
If relevant parameters related to pre-trained word embeddings e.g.
--vocab_path File path to the vocabulary file that has all the words in the pre-trained embedding, one word per line
--vector_path File path to the embedding vector in .txt format, one vector per line
are not provided, run
python cnn_train.py --dataset_path "file/path/to/processed/twitter/dataset/csv/file" --lang "lang"
the CNN model will not use pre-trained embeddings.
To use pre-trained embeddings, run
python cnn_train.py --dataset_path "file/path/to/processed/twitter/dataset/csv/file" --lang "lang" --vocab_path "path/to/vocab/file --vector_path "path/to/embedding/vector/file/in/text/format"
train model from the processed Venezuela election dataset
python cnn_train.py --dataset_path downloads/processed/vz-tweets.csv --lang "es"