To scrape Dutch tweets and automatically translate them:
pip install -r requirements.txt
to install dependencies,./scrape-and-translate.sh
to scrape and immediately translate Dutch tweets to English.
Because we wanted to manually drop nonsensical tweets in between these steps, we had a different workflow to create data. This way, we could subdivide the manual work among all five authors.
Our workflow:
pip install -r requirements.txt
to install dependencies,python parse-twitter-data.py
to scrape tweets,- split up result (
tweets_dutch.txt
) into 5 subsets, one for each author, - all authors dropped their nonsensical tweets manually,
- all authors labelled their tweets,
- all authors translated their tweets using
python translate.py
(edited to include paths to personal Dutch and English files) - recombine authors' Dutch, English and label files into:
/data processing/combined/tweets_dutch.txt
/data processing/combined/tweets_english.txt
/data processing/combined/labels.txt
- ran
python "/data processing/combined/txts_to_csvs.py"
The resulting files (dutch.csv
and english.csv
) are loaded by our Colab notebook.
For the steps we have taken in order to train our model and analyse its performance, please have a look at our Colab notebook.