You only need to do this if you need to access the Twitter API
- Create a file
scripts/CREDS.py
. - Copy the contents of
scripts/CREDS_example.py
intoscripts/CREDS.py
. - Fill out
scripts/CREDS.py
with your Twitter API credentials.
You only need to complete the following steps if you are making changes to the preprocessing script. Otherwise, skip to the Preprocessed Data.
- (Only Isabel can do this step bc only she has access to the grid. She uploaded the output of this script to the Drive.) On the CLSP grid
python parse_clsp_data.py -d /path/to/mark/data/ -o /path/to/mydir/ --num_cores N
. - Download the raw data from the Google Drive. It's called
raw_twitter_data.tar.gz
. This is a compressed version of the data from the CLSP grid. - Unzip by running
tar -xf raw_twitter_data.tar.gz
To learn more about.tar.gz
files, see here. - To run the preprocessing script, run
$ python preprocess_tweets.py --datadir /path/to/raw/data/ --outdir /path/to/data/
- Upload the updated
all_data_preprocess.tsv
file to the drive.
The files data/{train/dev/test}-ids.txt
contain tweet ids with for each split of the dataset. MAKE SURE TO USE THESE SPLITS WHEN TRAINING/TUNING/TESTING.
Download the all_data_preprocess.tsv
file from the drive. It's a tab separated file. Save it to data/
You still have to featurize this data, depending on your model choice.
It contains the following columns:
id: Tweet ID
text: Tweet text
processed_text: Tweet text, cleaned
author_id: Tweet author id
retweet_count: int
reply_count: int
like_count: int
quote_count: int
author_followers: int
mentions: list of mentions
mentions_count: the number of mentions
hashtags: string of hashtags
label: Bindary label to predict
hashtags_tfidf: Precomputed hashtag tfidf scores
sentiment_score_pos: Precomputed sentiment score positive
sentiment_score_neu: Precomputed sentiment score neutrial
sentiment_score_neg: Precomputed sentiment score negative
sentiment_score_comp: Precomputed sentiment score composed
text_tfid_sum: the sum of all tfids of the words in the text of the tweet
text_tfid_max: the max tfids of the words in the text of the tweet
text_tfid_min: the min tfids of the words in the text of the tweet
text_tfid_avg: the average tfids of the words in the text of the tweet
text_tfid_std: the standart deviation tfids of the words in the text of the tweet
hashtag_tfid_sum: the sum of all tfids of the hashtags of the tweet
hashtag_tfid_max: the max tfids of the hashtags of the tweet
hashtag_tfid_min: the min tfids of the hashtags of the tweet
hashtag_tfid_avg: the average tfids of the hashtags of the tweet
hashtag_tfid_std: the standart deviation of the tfids of the hashtags of the tweet
Download the file data_upsampled_ncf.zip
from the drive. Unzip it in data/balanced/
. To train/predict on balanced data, use the flag --balanced
. You can only use this data on non-sequential models.
To train a model, run the following command:
$ python main.py train {logreg|bi-lstm|simple-ff|svm|bert} {optional parameters}
The script will cache the featurized data in data/
. If you are making changes to the featurization, use the flag --override-cache
.
The script saves the modeling weights to models/
.
If you want to use up-sampling to balance the data (in case you have highly umbalanced data), use the flag --balance
To test a model, run the following command: To train a model, run the following command:
$ python main.py predict {logreg|bi-lstm|simple-ff|svm|bert} {optional parameters}
The script saves predictions to preds/
and testing metrics to scores/