/GeolocatonPredictor-ML-NB

This is the code for predicting geolocation of tweets trainning on token frequency using Decision Tree and Naïve Bayes. @TheUniversityOfMelbourne @pancak3 all rights reserved.

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Geolocaton Predictor

This is the code for predicting geolocaton of tweets trainning on token frequencies using Decision Tree and Naïve Bayes.

Implementation

Feature selection

In util/preprocessing/merge.py,

  • feature_filter shows it drops single character features like [a, b, ..., n]
  • merge shows it intuitively merges similar features like [aha, ahah, ..., ahahahaha] and [taco, tacos]

Classifier Combination

In preprocess/merge.py,

Instance manipulation

In util/train.py,

  • complement_nb shows it uses bagging to generate multiple training datasets.
  • complement_nb also shows it uses 42-Fold Cross Validation to generate multiple training datasets.

Algorithm manipulation

In util/train.py,

  • complement_nb also shows it uses GridSearchCV to generate multiple classifiers and select the best based on accuracy.

Dataset

Requirements

  • python3+
pip install -r requirements.txt  

Usage

Note: The code will remove the old models and results every time running. MAKE SURE you have saved your satisfying models..

Train

python run.py -t datasets/train-best200.csv datasets/dev-best200.csv  

the output would be like:

INFO:root:[*] Merging datasets/train-best200.csv   
 42%|████████         | 1006/2396 [00:05<00:20, 92.03 users/s]  
...  
...  
[*] Saved models/0.8126_2019-10-02_20:02  
[*] Accuracy: 0.8125955095803455  
 precision    recall   f_scoreCalifornia   0.618944  0.835128  0.710966  
NewYork      0.899371  0.854647  0.876439  
Georgia      0.788070  0.622080  0.695305  
weighted     0.827448  0.812596  0.814974  

Predict

python run.py -p models/ datasets/dev-best200.csv   

the output would be like:

...  
INFO:root:[*] Saved results/final_results.csv  
INFO:root:[*] Time costs in seconds:  
 PredictTime_cost  11.98s  

Score

python run.py -s results/final_results.csv  datasets/dev-best200.csv  

the output would be like:

[*] Accuracy: 0.8224697308099213  
 precision    recall   f_scoreCalifornia   0.653035  0.852199  0.739441  
NewYork      0.747993  0.647940  0.694381  
Georgia      0.909456  0.858296  0.883136  
weighted     0.833854  0.822470  0.824577  
INFO:root:[*] Time costs in seconds:  
 ScoreTime_cost  1.48s  
  

Train&Predict&Score

python run.py \
 -t datasets/train-best200.csv datasets/dev-best200.csv \
 -p models/ datasets/dev-best200.csv \
 -s results/final_results.csv datasets/dev-best200.csv

Help

python run.py -h  

Used libraries

  • sklearn for easily using Complement Naive Bayes, some feature selectors and other learning tools.
  • pandas, numpy for easily handling data.
  • tqdm for showing the process of loop.
  • joblib for dumping/loading memory to/from disk.
  • nltk for capturing word types on the purpose of feature filtering

License

See LICENSE file.