/DAZER

The Tensorflow implementation of accepted ACL 2018 paper "A deep relevance model for zero-shot document filtering", Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen, http://aclweb.org/anthology/P18-1214

Primary LanguagePython

DAZER

The Tensorflow implementation of our ACL 2018 paper:
A Deep Relevance Model for Zero-Shot Document Filtering, Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen Paper url: http://aclweb.org/anthology/P18-1214

Requirements

  • Python 3.5
  • Tensorflow 1.2
  • Numpy
  • Traitlets

Guide To Use

Prepare your dataset: first, prepare your own data. See Data Preparation

Configure: then, configure the model through the config file. Configurable parameters are listed here

See the example: sample.config

In additional, you need to change the zero-shot label settings in get_label.py

(You need make sure both get_label.py and model.py are put in same directory)

Training : pass the config file, training data and validation data as

python model.py config-file\
    --train \
    --train_file: path to training data\
    --validation_file: path to validation data\
    --checkpoint_dir: directory to store/load model checkpoints\ 
    --load_model: True or False(depends on existing or not). Start with a new model or continue training

See example: sample-train.sh

Testing: pass the config file and testing data as

python model.py config-file\
    --test \
    --test_file: path to testing data\
    --test_size: size of testing data (number of testing samples)\
    --checkpoint_dir: directory to load trained model\
    --output_score_file: file to output documents score\

Relevance scores will be output to output_score_file, one score per line, in the same order as test_file.

Data Preparation

All seed words and documents must be mapped into sequences of integer term ids. Term id starts with 1.

Training Data Format

Each training sample is a tuple of (seed words, postive document, negative document)

seed_words \t postive_document \t negative_document

Example: 334,453,768 \t 123,435,657,878,6,556 \t 443,554,534,3,67,8,12,2,7,9

Testing Data Format

Each testing sample is a tuple of (seed words, document)

seed_words \t document

Example: 334,453,768 \t 123,435,657,878,6,556

Validation Data Format

The format is same as training data format

Label Dict File Format

Each line is a tuple of (label_name, seed_words)

label_name/seed_words

Example: alt.atheism/atheist christian atheism god islamic

Word2id File Format

Each line is a tuple of (word, id)

word id

Example: world 123

Embedding File Format

Each line is a tuple of (id, embedding)

id embedding

Example: 1 0.3 0.4 0.5 0.6 -0.4 -0.2

Configurations

Model Configurations

  • BaseNN.embedding_size: embedding dimension of word
  • BaseNN.max_q_len: max query length
  • BaseNN.max_d_len: max document length
  • DataGenerator.max_q_len: max query length. Should be the same as BaseNN.max_q_len
  • DataGenerator.max_d_len: max query length. Should be the same as BaseNN.max_d_len
  • BaseNN.vocabulary_size: vocabulary size
  • DataGenerator.vocabulary_size: vocabulary size
  • BaseNN.batch_size: batch size
  • BaseNN.max_epochs: max number of epochs to train
  • BaseNN.eval_frequency: evaluate model on validation set very this epochs
  • BaseNN.checkpoint_steps: save model very this epochs

Data

  • DAZER.emb_in: path of initial embeddings file
  • DAZER.label_dict_path: path of label dict file
  • DAZER.word2id_path: path of word2id file

Training Parameters

  • DAZER.epsilon: epsilon for Adam Optimizer
  • DAZER.embedding_size: embedding dimension of word
  • DAZER.vocabulary_size: vocabulary size of the dataset
  • DAZER.kernal_width: width of the kernel
  • DAZER.kernal_num: num of kernel
  • DAZER.regular_term: weight of L2 loss
  • DAZER.maxpooling_num: num of K-max pooling
  • DAZER.decoder_mlp1_num: num of hidden units of first mlp in relevance aggregation part
  • DAZER.decoder_mlp2_num: num of hidden units of second mlp in relevance aggregation part
  • DAZER.model_learning_rate: learning rate for model instead of adversarial calssifier
  • DAZER.adv_learning_rate: learning rate for adversarial classfier
  • DAZER.train_class_num: num of class in training time
  • DAZER.adv_term: weight of adversarial loss when updating model's parameters
  • DAZER.zsl_num: num of zero-shot labels
  • DAZER.zsl_type: type of zero-shot label setting ( you may have multiply zero-shot settings in same number of zero-shot label, this indicates which type of zero-shot label setting you pick for experiemnt, see get_label.py for more details )