/twitter-deepgeo

Tensorflow code to train and predict geolocation of tweets; network also produces compact representation of tweets for hashing purposes

Primary LanguagePythonApache License 2.0Apache-2.0

Requirements

  • python2.7
  • tensorflow 0.8-0.12
  • ujson
  • numpy
  • scipy (for retrieval_map.py, geo_test.py)
  • matplotlib (for geo_test.py)

Data

  • Validation (development) and test tweet data are provided (under data/)
  • For training tweets, use the training downloader script from WNUT Workshop 2016. The script will fetch the training tweet metadata from the official API

Pre-trained Models

  • Pre-trained models can be downloaded here (1.6GB)
  • There are a total of 12 models:
    • deepgeo_RXXX: R=XXX, sigma=0.0, alpha=0.0
    • deepgeo_RXXX_noise: R=XXX, sigma=0.1, alpha=0.0
    • deepgeo_RXXX_loss: R=XXX, sigma=0.0, alpha=0.1
  • These are the models presented in Table 6 in the paper. R is the dimension of the representation, sigma is the Gaussian noise standard deviation, and alpha is the scaling factor the additional loss term l

Train

  • python geo_train.py
  • Configurations are all defined in config.py
  • The default values are the optimal hyper-parameter settings used in the paper
  • Note that the first epoch can take a long time to finish (potentially 6+ hours), but subsequent epochs should run fairly quickly. The slow start is due to network initialisation.
  • On a single K80 GPU, it takes around 25-30 hours to train 10 epochs on the full training data.

Test

usage: geo_test.py [-h] -m MODEL_DIR [-d INPUT_DOC] [-l INPUT_LABEL]
                   [--predict] [--save_rep SAVE_REP] [--save_label SAVE_LABEL]
                   [--save_mat SAVE_MAT] [--print_attn] [--print_time]

Given trained model, perform various test inferences

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL_DIR, --model_dir MODEL_DIR
                        directory of the saved model
  -d INPUT_DOC, --input_doc INPUT_DOC
                        input file containing the test documents
  -l INPUT_LABEL, --input_label INPUT_LABEL
                        input file containing the test labels
  --predict             classify test instances and compute accuracy
  --save_rep SAVE_REP   save representation (thresholded and converted to
                        binary) of test instances
  --save_label SAVE_LABEL
                        save label of test instances
  --save_mat SAVE_MAT   save representation (floats) and label of test
                        instances in MAT format
  --print_attn          print attention on text span
  --print_time          print time, offset and usertime distribution for
                        popular locations

Example: Compute test data accuracy

python geo_test.py -m MODEL_DIR -d data/test/data.tweet.json -l data/test/label.tweet.json --predict

Example: Print character attention in test data

python geo_test.py -m MODEL_DIR -d data/test/data.tweet.json -l data/test/label.tweet.json --print_attn

Compute MAP Performance

  • Example script is given in: compute_map.sh
  • The idea is to first generate binary code representation for train and test data, and then use retrieval_map to compute hamming distance and MAP
  • In the example script, we use the validation data as the train data, as it is much smaller

Publication

Lau, Jey Han, Lianhua Chi, Khoi-Nguyen Tran and Trevor Cohn (to appear). End-to-end Network for Twitter Geolocation Prediction and Hashing. In Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), Taipei, Taiwan.