ML-Fall2020-Final-Project

Data

Creating your credentials file

You only need to do this if you need to access the Twitter API

Create a file scripts/CREDS.py.
Copy the contents of scripts/CREDS_example.py into scripts/CREDS.py.
Fill out scripts/CREDS.py with your Twitter API credentials.

Raw Data

You only need to complete the following steps if you are making changes to the preprocessing script. Otherwise, skip to the Preprocessed Data.

(Only Isabel can do this step bc only she has access to the grid. She uploaded the output of this script to the Drive.) On the CLSP grid python parse_clsp_data.py -d /path/to/mark/data/ -o /path/to/mydir/ --num_cores N.
Download the raw data from the Google Drive. It's called raw_twitter_data.tar.gz. This is a compressed version of the data from the CLSP grid.
Unzip by running tar -xf raw_twitter_data.tar.gz To learn more about .tar.gz files, see here.
To run the preprocessing script, run $ python preprocess_tweets.py --datadir /path/to/raw/data/ --outdir /path/to/data/
Upload the updated all_data_preprocess.tsv file to the drive.

The files data/{train/dev/test}-ids.txt contain tweet ids with for each split of the dataset. MAKE SURE TO USE THESE SPLITS WHEN TRAINING/TUNING/TESTING.

Preprocessed Data

Download the all_data_preprocess.tsv file from the drive. It's a tab separated file. Save it to data/

You still have to featurize this data, depending on your model choice.

It contains the following columns:

id: Tweet ID
text: Tweet text
processed_text: Tweet text, cleaned
author_id: Tweet author id
retweet_count: int
reply_count: int
like_count: int
quote_count: int
author_followers: int
mentions: list of mentions
mentions_count: the number of mentions
hashtags: string of hashtags
label: Bindary label to predict
hashtags_tfidf: Precomputed hashtag tfidf scores
sentiment_score_pos: Precomputed sentiment score positive
sentiment_score_neu: Precomputed sentiment score neutrial
sentiment_score_neg: Precomputed sentiment score negative
sentiment_score_comp: Precomputed sentiment score composed
text_tfid_sum: the sum of all tfids of the words in the text of the tweet
text_tfid_max: the max tfids of the words in the text of the tweet
text_tfid_min: the min tfids of the words in the text of the tweet
text_tfid_avg: the average tfids of the words in the text of the tweet
text_tfid_std: the standart deviation tfids of the words in the text of the tweet
hashtag_tfid_sum: the sum of all tfids of the hashtags of the tweet
hashtag_tfid_max: the max tfids of the hashtags of the tweet
hashtag_tfid_min: the min tfids of the hashtags of the tweet
hashtag_tfid_avg: the average tfids of the hashtags of the tweet
hashtag_tfid_std: the standart deviation of the tfids of the hashtags of the tweet

Balanced Data

Download the file data_upsampled_ncf.zip from the drive. Unzip it in data/balanced/. To train/predict on balanced data, use the flag --balanced. You can only use this data on non-sequential models.

Training

To train a model, run the following command:

$ python main.py train {logreg|bi-lstm|simple-ff|svm|bert} {optional parameters}

The script will cache the featurized data in data/. If you are making changes to the featurization, use the flag --override-cache.

The script saves the modeling weights to models/.

If you want to use up-sampling to balance the data (in case you have highly umbalanced data), use the flag --balance

Testing

To test a model, run the following command: To train a model, run the following command:

$ python main.py predict {logreg|bi-lstm|simple-ff|svm|bert} {optional parameters}

The script saves predictions to preds/ and testing metrics to scores/

isabelcachola/ML-Fall2020-Final-Project