/Jaist-MicroNet-Challenge

Submission for WikiText-103 Language Modeling task in MicroNet Challenge

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Jaist team - MicroNet Challenge

Overview

In MicroNet Challenge, we consider “WikiText-103 Language Modeling” task. This task required number of parameters and math operations of the learned model are minimum. And, the need condition is the perplexity of model below 35 on the test set. With this goal, we proposed an approach based on QRNN model (Quasi-Recurrent Neural Networks). By turing parameter, we reduced the number of parameters and sure the perplexity below 35. Parameter tuning focus on 3 parameters: sequence length, embedding size and number of hidden units per layer. After changing, the number of parameters of our model reduce approx 32% when compared to the default model.

Results: 209.811 parameters(MBytes), 156.857 MFLOPS

Method

Base Model

Our approach based on the Quasi-Recurrent Neural Networks - an approach to neural sequence modeling that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels (link paper).

In “An Analysis of Neural Language Modeling at Multiple Scales”, QRNN was used for wiki-103 dataset, with parameters setting:

  • Number of epoch (--epochs): 14
  • Number of layer (--nlayers): 4
  • Size of word embeddings (--emsize): 400
  • Number of hidden units per layer (--nhid): 2500
  • Alpha L2 regularization on RNN activation (--alpha): 0
  • Beta slowness regularization applied on RNN activiation (--beta): 0
  • Dropout to remove words from embedding layer (--dropoute): 0
  • Dropout for rnn layers (--dropouth): 0.1
  • Dropout for input embedding layers (--dropouti): 0.1
  • Dropout applied to layers (--dropout): 0.1
  • Amount of weight dropout to apply to the RNN hidden to hidden matrix (--wdrop): 0
  • Weight decay applied to all weights (--wdecay): 0
  • Sequence length (--bptt): 140
  • Batch size (--batch_size): 60
  • Optimizer to use (--optimizer): adam
  • Learning rate (--lr): 1e-3

With this setting, the result of QRNN when run on wiki-103 dataset: Total parameters: 153,886,638 Test perplexity: 32.58

Parameter Tuning

We focus on reducing the number of parameters of model follow 2 ways: Sequence length (--bptt) and Embedding size and number of hidden units per layer(--emsize, --nhid).

Sequence length in training

In QRNN model, there is a parameter is “--bptt”. It is sequence length. In training, all training data was connected and created a long sequence. Sequence length will divide the sequence for sub-sequences with length is the value of “-bptt”. We tried to change this value to analysis effective for ppl. The recommended value is 140. The default value is 70. The figure shows the perplexity with the change of sequence length.

Embedding size and number of hidden units per layer

With two parameters, we expect change value to reduce the number of parameters of model and keep value of test ppl < 35. The default model use the embedding size is 400 and the number of hidden units per layer is 2500. With the setting, the default had 153M parameter and the perplexity reached 32.58. Our idea is to reduce the number of parameters. This make increase perplexity. So that, we try to balance for ppl not over 35. To reduce parameter, embedding size decreased to 300. This number is a popular choice for embedding size in deep learning models. With that, the number of hidden units per layer also change for fit. The table shows some our experiments with this changing.

--embsize --nhid --bptt --epochs #parameters(*) Test ppl
300 2000 140 20 110,007,135 34.31
300 1500 140 20 98,152,635 36.46
300 1750 140 20 103,704,885 35.37
300 2000 200 20 110,007,135 34.14
300 1850 200 20 106,135,785 34.84
300 1800 300 20 104,905,335 34.71
(*) Excluding 903 parameters from the training criterion/loss function.

With some experiments, we chose --emsize 300 --nhid 1800 --bptt 300 is the best state.

op_name params(MBytes) mults(M) adds(M) MFLOPS
embedding 160.641 0.000 0.000 0.000
block_qrnn 48.634 12.176 24.334 36.510
block_decoder 0.535 40.294 80.053 120.347
total 209.811 52.470 104.387 156.857

We used the "freebie" quantization:

counter = counting.MicroNetCounter(ops, add_bits_base=32, mul_bits_base=32)
INPUT_BITS = 16
ACCUMULATOR_BITS = 32
PARAMETER_BITS = INPUT_BITS
SUMMARIZE_BLOCKS = False
counter.print_summary(0, PARAMETER_BITS, ACCUMULATOR_BITS, INPUT_BITS, summarize_blocks=SUMMARIZE_BLOCKS)

System Configuration

Software Requirements (codebase)

  • Python 3.7
  • Pytorch: 0.4
  • pynvrtc (NVIDIA's Python Bindings to NVRTC) (pip install git+git://github.com/NVIDIA/pynvrtc/commit/6417a2896ff8a99f2c4d4195de657671a77c89a0)

Training

python -u main.py --epochs 20 --nlayers 4 --emsize 300 --nhid 1800 --alpha 0 --beta 0 --dropoute 0 --dropouth 0.1 --dropouti 0.1 --dropout 0.1 --wdrop 0 --wdecay 0 --bptt 300 --batch_size 40 --optimizer adam --lr 1e-3 --data data/wikitext-103 --save WT103.12hr.QRNN.pt --when 12 --model QRNN

Checkpoint Test

python test.py

Members

  • Associate Professor Nguyen Le Minh (Jaist)
  • Professor Tomoko Matsui (ISM)
  • Ph.D. Tran Duc Vu (Jaist)
  • MS. Nguyen Ha Thanh (Jaist)
  • MS. Dang Tran Binh (Jaist)