The Tensorflow implementation of our ACL 2018 paper:
A Deep Relevance Model for Zero-Shot Document Filtering, Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen
Paper url: http://aclweb.org/anthology/P18-1214
- Python 3.5
- Tensorflow 1.2
- Numpy
- Traitlets
Prepare your dataset: first, prepare your own data. See Data Preparation
Configure: then, configure the model through the config file. Configurable parameters are listed here
See the example: sample.config
In additional, you need to change the zero-shot label settings in get_label.py
(You need make sure both get_label.py and model.py are put in same directory)
Training : pass the config file, training data and validation data as
python model.py config-file\
--train \
--train_file: path to training data\
--validation_file: path to validation data\
--checkpoint_dir: directory to store/load model checkpoints\
--load_model: True or False(depends on existing or not). Start with a new model or continue training
See example: sample-train.sh
Testing: pass the config file and testing data as
python model.py config-file\
--test \
--test_file: path to testing data\
--test_size: size of testing data (number of testing samples)\
--checkpoint_dir: directory to load trained model\
--output_score_file: file to output documents score\
Relevance scores will be output to output_score_file, one score per line, in the same order as test_file.
All seed words and documents must be mapped into sequences of integer term ids. Term id starts with 1.
Training Data Format
Each training sample is a tuple of (seed words, postive document, negative document)
seed_words \t postive_document \t negative_document
Example: 334,453,768 \t 123,435,657,878,6,556 \t 443,554,534,3,67,8,12,2,7,9
Testing Data Format
Each testing sample is a tuple of (seed words, document)
seed_words \t document
Example: 334,453,768 \t 123,435,657,878,6,556
Validation Data Format
The format is same as training data format
Label Dict File Format
Each line is a tuple of (label_name, seed_words)
label_name/seed_words
Example: alt.atheism/atheist christian atheism god islamic
Word2id File Format
Each line is a tuple of (word, id)
word id
Example: world 123
Embedding File Format
Each line is a tuple of (id, embedding)
id embedding
Example: 1 0.3 0.4 0.5 0.6 -0.4 -0.2
Model Configurations
BaseNN.embedding_size
: embedding dimension of wordBaseNN.max_q_len
: max query lengthBaseNN.max_d_len
: max document lengthDataGenerator.max_q_len
: max query length. Should be the same asBaseNN.max_q_len
DataGenerator.max_d_len
: max query length. Should be the same asBaseNN.max_d_len
BaseNN.vocabulary_size
: vocabulary sizeDataGenerator.vocabulary_size
: vocabulary sizeBaseNN.batch_size
: batch sizeBaseNN.max_epochs
: max number of epochs to trainBaseNN.eval_frequency
: evaluate model on validation set very this epochsBaseNN.checkpoint_steps
: save model very this epochs
Data
DAZER.emb_in
: path of initial embeddings fileDAZER.label_dict_path
: path of label dict fileDAZER.word2id_path
: path of word2id file
Training Parameters
DAZER.epsilon
: epsilon for Adam OptimizerDAZER.embedding_size
: embedding dimension of wordDAZER.vocabulary_size
: vocabulary size of the datasetDAZER.kernal_width
: width of the kernelDAZER.kernal_num
: num of kernelDAZER.regular_term
: weight of L2 lossDAZER.maxpooling_num
: num of K-max poolingDAZER.decoder_mlp1_num
: num of hidden units of first mlp in relevance aggregation partDAZER.decoder_mlp2_num
: num of hidden units of second mlp in relevance aggregation partDAZER.model_learning_rate
: learning rate for model instead of adversarial calssifierDAZER.adv_learning_rate
: learning rate for adversarial classfierDAZER.train_class_num
: num of class in training timeDAZER.adv_term
: weight of adversarial loss when updating model's parametersDAZER.zsl_num
: num of zero-shot labelsDAZER.zsl_type
: type of zero-shot label setting ( you may have multiply zero-shot settings in same number of zero-shot label, this indicates which type of zero-shot label setting you pick for experiemnt, see get_label.py for more details )