/neural-tweet-search

Multi-Perspective Relevance Matching with Hierarchical ConvNets for Social Media Search (Rao et al. AAAI'19)

Primary LanguagePythonApache License 2.0Apache-2.0

Multi-Perspective Relevance Matching with Hierarchical ConvNets for Social Media Search

This repo contains code and data for our neural tweet search paper published in AAAI'19.

Given a query, we aim to return the most relevant documents(tweets) by ranking their relevency. In social media search, the scenario is different as standard ad-hoc retrieval: shorter document length, less formal languages and multiple relevance source signals (e.g., URL, hashtag). We propose a hierarchical convolutional model to approach the hetergeneous relevance signals (tweet, URL, hashtag) at multiple perspectives, including character-, word-, phrase- and sentence-level modeling. Our model demonstrated significant gains on multiple twitter datasets against state-of-the-art neural ranking models. More details can be found in our paper.

Requirements

  • Python 2.7
  • Tensorflow or Theano (tested on TF 1.4.1)
  • Keras (tested on 2.0.5)

Install

  • Download our repo:
git clone https://github.com/Jeffyrao/neural-tweet-search.git
cd neural-tweet-search
  • Install gdrive
  • Download required data and word2vec:
$ chmod +x download.sh; ./download.sh
  • Install Tensorflow and Keras dependency:
$ pip install -r requirements.txt

Run

  • Train and test on GPU:
CUDA_VISIBLE_DEVICES=0 python -u train.py -t trec-2013

The path of best model and output predictions will be shown in the log. Default parameters should work reasonably well.

  • Note: you might need around ~40GB memory to create the dataset (because of the large size of IDF weights). Please file a issue if you have any problem in creating the dataset.

  • Parameter sweep to find the best parameter set:

chmod +x param_sweep.sh; ./param_sweep.sh trec-2013 &

This command will save all the outputs under tune-logs folder.

Evaluate with trec_eval

$ ./trec_eval.8.1/trec_eval data/twitter-v0/qrels.microblog2011-2014.txt \
                            best_run/mphcnn_trec_2013_pred.txt

This should return the exact MPHCNN score on TREC 2013 dataset (MAP: 0.2818, P30: 0.5222) we reported in our paper.

Command line parameters

option input format default description
-t [trec-2011, trec-2012, trec-2013, trec-2014] trec-2011 test set
-l [true, false] false whether to load pre-created dataset (set to true when data is ready)
--load_model [true, false] false whether to load pre-trained model
-b [1, n) 64 batch size
-n [1, n) 256 number of convolutional filters
-d [0, 1] 0.1 dropout rate
-o [sgd, adam, rmsprop] sgd optimization method
--lr [0, 1] 0.05 learning rate
--epochs [1, n) 15 number of training epochs
--trainable [true, false] true whether to train word embeddings
--val_split (0, 1) 0.15 percentage of validation set sampled from training set
-v [0, 1, 2] 1 verbose (for logging), 0 for silent, 1 for interactive, 2 for per-epoch logging
--conv_option [normal, ResNet] normal convolutional model, normal or ResNet
--model_option [complete, word-url] complete what input sources to use, complete for MP-HCNN, word-url for only modeling query-tweet (word) and query-url (char)

Reference

If you are using this code or dataset, please kindly cite the paper below:

@article{rao2019multi,
  title={Multi-Perspective Relevance Matching with Hierarchical ConvNets for Social Media Search},
  author={Rao, Jinfeng and Yang, Wei and Zhang, Yuhao and Ture, Ferhan and Lin, Jimmy},
  journal={Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI)},
  year={2019}
}