- This repo is the code of paper Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling
- The deep learning framework is Tensorflow/TF 1.3 (compatible with 1.2 and 1.2.1)
- Please contact author or open an issue for questions and suggestions
- The codes for the Directional Self-Attention Network (DiSAN), the basis of BiBloSAN, are available here
News: A time- and memory-efficient self-attention mechanism named as Fast-DiSA has been proposed, which is as fast as multi-head self-attention but uses the multi-dim and positional masks techniques. The codes are released at here
Cite this paper using BibTex:
@inproceedings{shen2018biblosan,
Author = {Shen, Tao and Zhou, Tianyi and Long, Guodong and Jiang, Jing and Zhang, Chengqi},
Booktitle = {International Conference on Learning Representations (ICLR)},
Title = {Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling},
Year = {2018}
}
- Python3 (verified on 3.5.2, or Anaconda3 4.2.0)
- tensorflow>=1.2
- Numpy
- We provide an universal interface for sequence encoding layers, which includes the proposed Bi-directional Block Self-Attention Network (Bi-BloSAN), Directional Self-Attention Network(DiSAN), CNN, Bi-LSTM, Bi-GRU, Bi-SRU and Multi-head attention, as well as the corresponding context fusion models.
- The experiments code on various benchmarks, i.e., Stanford Natural Language Inference, SICK, simplified SQuAD, TREC question-type classification, fine-grained/binary Stanford Sentiment Treebank, Customer Reviews, MPQA and SUBJ.
The codes are stored in directory context_fusion of this repo, and just import the functions from package context_fusion if you want to use them:
from context_fusion.interface import context_fusion_layers, sentence_encoding_models
These two functions share the similar parameter definitions:
- param rep_tensor: [tf Float 3D Tensor] The rank must be 3, i.e., [batch_size, seq_len, feature_dim];
- param rep_mask: [tf Bool 2D Tensor] The mask for input with rank 2: [batch_size, seq_len];
- param method: [str] different methods supported by context fusion and sentence encoding, which will be introduced below table.
- param activation_function: [str] in (relu|elu|slu)
- param scope: [str] The name of Tensorflow variable scope
- param wd: [float >= 0] if wd>0, add related tensor to tf collectoion "weight_decay" for further l2 decay
- param is_train: [tf Bool Scalar] the indicator of training or test.
- param keep_prob: [float] dropout keep probability
- param hn: [int/None] hidden units num, if None, the hn is same to the input feature dim in some methods.
- other: if the method is set to block, we can add
block_len=5
to specify the length of block in Bi-BloSA.
Method Str | Explanation | Context Fusion | Sentence Encoding |
---|---|---|---|
cnn_kim | CNN from Yoon Kim (just for sentence encoding) | F | T |
no_ct | No context | F | T |
lstm | Bi-LSTM | T | T |
gru | Bi-GRU | T | T |
sru | Bi-SRU (Simple Recurrent Unit) | T | T |
multi_cnn | Multi-window CNN with context info added | T | T |
hrchy_cnn | Multi-layer CNN with resConnect + GLU | T | T |
multi_head | multi-head attention with attention dropout and positional encoding | T | T |
disa | Directional Self-attention(DiSA) | T | T |
block | Bi-directional Block Self-Attention(Bi-BloSA) | T | T |
- Directory exp_SNLI --- python project for Stanford Natural Language Inference dataset
- Directory exp_SICK --- python project for Sentences Involving Compositional Knowledge dataset
- Directory exp_SQuAD_sim --- python project for simplified Stanford Question Answering dataset
- Directory exp_SST --- python project for fine-grained and binary Stanford Sentiment Treebank dataset
- Directory exp_TREC --- python project for TREC question-type classification dataset
- Directory exp_SC --- python project for three sentence classification bench marks: Customer Review, MPQA and SUBJ.
Here, we introduce the shared parameters which appear in all these benchmark projects:
--network_type
: [str] use "exp_context_fusion" for reproduce our experiments--log_period
: [int] step period to save the summary--eval_period
: [int] step period to evaluate the model on dev dataset;--gpu
: [int] GPU index to run the codes;--gpu_mem
: [None or float] if None, it allow soft memory placement, or fixed proportion of the total memory;--save_model
: [Bool] whether save the top-3 eval chechpoints;--mode
: [str] all projects have "train" model to train model, only some projects have "test" model for test, please refer the main script of every project;--load_model
: [Bool] if has "test" mode, set this to True to load checkpoint for test by specify the path-to-ckpt in--load_path
;--model_dir_suffix
: [str] a name for the model dir (the model dir is a part of programming framework which will be introduced latter);--num_steps
: [int] the training step for mini-batch SGD;--train_batch_size
: [int] training batch size--test_batch_size
: [int] dev and test batch size. Note that in squad, use--test_batch_size_gain
instead of it: test_batch_size = test_batch_size_gain * train_batch_size;--word_embedding_length
: [int] word embedding length to load glove pretrained models, which could be 100|200|300 depending on the data provided--glove_corpus
: [str] GloVe corpus name, which could be 6B|42B|840B etc.;--dropout
: [float] dropout keep probability;--wd
: L2 regularization decay factor;--hidden_units_num
: hidden units num. Note that we fix the hidden units num for all context_fusion models on all projects except squad.--optimizer
: [str] mini-batch SGD optimizer, could be adam|adadelta|rmsprop;--learning_rate
: [float] initial learning rate for optimizer;--context_fusion_method
: context fusion method fornetwork_type
of context_fusion, could be block|lstm|gru|sru|sru_normal|cnn|cnn_kim|multi_head|multi_head_git|disa|no_ct;--block_len
: [None or int] ifcontext_fusion_method
is set to block, the int block len is the parameter for Bi-BloSA. if the block_len is None, the program will invoke a function to calculate the block len according to the dataset characteristics automatically.
We first demonstrate the file directory tree of all these projects:
ROOT
--dataset[d]
----glove[d]
----$task_dataset_name$[d]
--src[d]
----model[d]
------template.py[f]
------context_fusion.py[f]
----nn_utils[d]
----utils[d]
------file.py[f]
------nlp.py[f]
------record_log.py[f]
------time_counter.py[f]
----dataset.py[f]
----evaluator.py[f]
----graph_handler.py[f]
----perform_recorder.py[f]
--result[d]
----processed_data[d]
----model[d]
------$model_specific_dir$[d]
--------ckpt[d]
--------log_files[d]
--------summary[d]
--------answer[d]
--configs.py[f]
--$task$_main.py[f]
--$task$_log_analysis.py[f]
Note: The result dir will appear after the first running.
We elaborate on the every files[f] and directory[d] as follows:
./configs.py
: perform the parameters parsing and definitions and declarations of global variables, e.g., parameter definition/default value, name(of train/dev/test_data, model, processed_data, ckpt etc.) definitions, directories(of data, result, $model_specific_dir$
etc.) definitions and corresponding paths generation.
./$task$_main.py
: this is the main entry python script to run the project;
./$task$_log_analysis.py
: this provides a function to analyze the log file of training process.
./dataset/
: this is the directory including datasets for current project.
./dataset/glove
: including pre-trained glove file./dataset/$task_dataset_name$/
: This is the dataset dir for current task, we will concretely introduce this in each project dir.
./src
: dir including python scripts
./src/dataset.py
: a class to process raw data from dataset, including data tokenization, token dictionary generation, data digitization, neural network data generation. In addition, there are also some method:generate_batch_sample_iter
for random mini-batch iteration,get_statistic
for sentence length statistics and a interface for deleting samples with long sentence in training data../src/evaluator.py
: a class for model evaluation../src/graph_handler.py
: a class for handling graph: session initialization, summary saving, model restore etc../src/perform_recoder.py
: a class to save top-n dev accuracy model checkpoint for future loading../src/model/
: the dir including tensorflow model file./src/model/template.py
: a abstract python class, including network placeholders, global tensorflow tensor variables, TF loss function, TF accuracy function, EMA for learnable variables and summary, training operation, feed dict generation and training step function../src/model/context_fusion.py
: Main TF neural network model, implement the abstract interfacebuild_network
inheriting fromtemplate.py
../src/nn_utils/
: a package include various tensorflow layers implemented by this repo author../src/utils/file.py
: file I/O functions../src/utils/nlp.py
: natural language processing functions../src/utils/record_log.py
: a log recorder class, and a corresponding instance for all use in current project../src/utils/time_counter.py
: time counter class to collect the training time, note this time exclude the process to prepare data but only the time spending in training step.
./result/
: a dir to place the results.
./result/processed_data/
: a dir to place Dataset instance with pickle format. The file name is generated byget_params_str
in./config.py
according to the related parameters../result/model/$model_specific_dir$/
: the name of this dir is generated byget_params_str
in./config.py
according to the related parameters, to save the result for a combination of the parameters. In other words, we associate a dir for a combination of the parameters../result/model/$model_specific_dir$/ckpt/
: a dir to save top-n model checkpoints../result/model/$model_specific_dir$/log_files/
: a dir to save log file../result/model/$model_specific_dir$/summary/
: a dir to save tensorboard summary and tensorflow graph meta files../result/model/$model_specific_dir$/answer/
: a dir to save extra prediction result for a part of these projects.
- tqdm
- nltk (please download Models/punkt)
git clone https://github.com/code4review/BiBloSA
cd $project_dir$
Then, based on the parameters introduction and programming framework, please refer to the README.md
in the $project_dir$
for data preparation, data processing and network training.
Note that due to many projects this repo includes, it is inevitable that there are some wrong when I organize the projects into this repo. If you confront some bugs or errors when running the codes, please feel free to report them by opening a issues. I will reply it ASAP.
clean up and organise all neural net codes in folder context_fusion.(Didn't be tested so far)add the cleaned codes to SNLI project(Didn't be tested so far)- test the codes updated before