This repository contains the python implementation for training a model that tracks relevant tweets for internal displacement monitoring and extracts their important information. The project is a collaboration with the NGO IDMC and was performed as part of the Data Science Hackathon "Hack4Good" by the Analytics Club at ETH Zurich. General information on the project can be found in the general report (Report_General.pdf) and technical details are explained in the technical report (Report_Technical.pdf). A blogpost on this project can be found on IDMC's Expert Opinion website.
-
Setup the environment by installing all requirements:
pip install -r requirements.txt
. -
Download pretrained Word2Vec model (https://drive.google.com/file/d/1lw5Hr6Xw0G0bMT1ZllrtMqEgCTrM7dzc/view) and store it in
data/utils_data
. -
Setup Twitter Developer Account and save credentials in
config/data_acquisition/secrets.json
- Apply for Twitter Developer Account: https://developer.twitter.com/en/apply-for-access
- Follow Application Process
- After succesfull application, visit Developer Dashboard: https://developer.twitter.com/en/dashboardCreate
- Go to App Dashboard and create an app
- After succefull app creation, save API consumer and access keys and tokens to
config/data_acquisition/secrets.json
-
Setup the internal displacement keywords (from csv in 'data/utils_data')
python src/setup.py
-
Train the classifier on included labelled english tweets
python src/run_trainDefaultClassifier.py
-
Train the custom Named Entity Recognition Model on the included training data
python src/run_trainNER.py
Default Parameter:
python src/run_fullClassificationPipeline.py
Configure Parameter:
python src/run_fullClassificationPipeline.py
-l en | es | fr
-c svm | randomforest | linear | bayes
The steps of the pipeline can also be executed individually.
This script extracts the tweets containing a combination of keywords and saves them in 'data/raw_data/raw_tweets'.
python src/data_acquisition/TweetExtractor.py
--language en | es | fr
--verbose
This script preprocesses the tweets saved in 'data/raw_data/raw_tweets' or 'data/raw_data/labelled_tweets' and saves the preprocessed tweets in 'data/preprocessed_data/predict_tweets' or 'data/preprocessed_data/labelled_tweets' respectively.
python src/data_preprocessing/TweetPreprocessor.py
--language en | es | fr
--labelled
--verbose
--date YYYY-mm-dd (optional, default is newest extraction date)
These scripts train, test and use a binary classifier on the data in 'data/preprocessed_data' and saves/loads the classifier from 'data/models'.
Classify preprocessed unlabelled tweets. The predictions are saved in 'results/predictions', the 'summary' file contains the extracted relevant information.
python src/data_classification/TweetClassifier.py
--mode predict
--language en | es | fr
--classifier svm | randomforest | linear | bayes
--date YYYY-mm-dd (optional, default is newest preprocessed date)
--classifierLanguage en
--verbose
Train new classifier on preprocessed labelled tweets:
python src/data_classification/TweetClassifier.py
--mode train
--classifierLanguage en
--classifier svm | randomforest | linear | bayes
--verbose
--cv
Test classifier on preprocessed labelled tweets. The predictions are saved with a test-prefix in 'results/predictions', the 'summary' file contains the extracted relevant information.
python src/data_classification/TweetClassifier.py
--mode test
--language en | es | fr
--classifier svm | randomforest | linear | bayes
--classifierLanguage en
--verbose
Extracting Information of already predicted tweets. However, this step is done automatically when predicting.
python src/data_classification/InformationExtractor.py
--language en | es | fr
--classifier svm | randomforest | linear | bayes
--classifierLanguage en
--date YYYY-mm-dd (necessary)
--labelled (if prediction of test tweets)
-
More languages can be used by including their keywords to the keyword csv files in 'data/utils_data'.
-
Models parameters can be configured in the model config: 'config/data_classification/model_config.json'
-
Translator implementation in 'src/data_preprocessing/Translator.py' can/should be modified e.g. with a google translator using a valid account.