Transformer-Translation

1. Description

Transformer model for translation (Attention is all you need)
This code was created by referring to the code in DongjunLee, Kyubyong, Official code
Trained on 1 gpu(Geforce Gtx 1080 ti)

2. Data

3. Train

1. Git clone

$ git clone https://github.com/MSWon/Transformer-Translation.git

2. Training with user settings

$ python train_transformer.py --num_layers 4 --num_heads 8 --batch_size 64 --training_epochs 25

Training tips

Transformer을 학습을 할 때 중요한 것은 max_length로 padding을 한 PAD에 대해서 처리를 해주어야 한다.
이를 위해 Transformer에서는 Encoder, Decoder, Encoder-Decoder attention에 대해서 적절히 masking을 한다.

1. Encoder mask

Encoder에서 PAD에 관한 부분을 -inf로 masking을 함으로써 softmax시 0을 갖도록 한다.

model_utils.py

def get_padding(x, padding_value=0, dtype=tf.float32):
  with tf.name_scope("padding"):
    return tf.cast(tf.equal(x, padding_value), dtype)

model_utils.py

def get_padding_bias(x):
  with tf.name_scope("attention_bias"):
    padding = get_padding(x)
    attention_bias = padding * _NEG_INF_FP32
    attention_bias = tf.expand_dims(tf.expand_dims(attention_bias, axis=1), axis=1)
  return attention_bias

2. Decoder mask

Decoder에서는 미래를 볼 수 없기 때문에 미래에 대한 부분을 -inf로 masking을 함으로써 softmax시 0을 갖도록 한다.

model_utils.py

def get_decoder_self_attention_bias(length, dtype=tf.float32):
  neg_inf = _NEG_INF_FP16 if dtype == tf.float16 else _NEG_INF_FP32
  with tf.name_scope("decoder_self_attention_bias"):
    valid_locs = tf.linalg.band_part(tf.ones([length, length], dtype=dtype), -1, 0)
    valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
    decoder_bias = neg_inf * (1.0 - valid_locs)
  return decoder_bias

3. Encoder-Decoder attention mask

Encoder-Decoder attention시에 key, value vector에 해당하는 PAD부분에 대해서 -inf masking을 해준다.

model_utils.py

def get_padding_bias(x):
  with tf.name_scope("attention_bias"):
    padding = get_padding(x)
    attention_bias = padding * _NEG_INF_FP32
    attention_bias = tf.expand_dims(tf.expand_dims(attention_bias, axis=1), axis=1)
  return attention_bias

4. Feed foward network

Encoder의 Feed foward network의 입력에서 PAD에 관한 부분에 대해서 tf.gather_nd를 통해 제거
tf.scatter_nd를 통해 출력 할 때 PAD 부분에 0 vector를 채운다.

layer.py

def dense_relu_dense(self, inputs, padding=None):
        if padding is not None:
            with tf.name_scope("remove_padding"):
                # Flatten padding to [batch_size*length]
                pad_mask = tf.reshape(padding, [-1])        
                nonpad_ids = tf.to_int32(tf.where(pad_mask < 1e-9))       
                # Reshape inputs to [batch_size*length, hidden_size] to remove padding
                inputs = tf.reshape(inputs, [-1, self.w2_dim])
                inputs = tf.gather_nd(inputs, indices=nonpad_ids)        
                # Reshape inputs from 2 dimensions to 3 dimensions.
                inputs = tf.expand_dims(inputs, axis=0)
            
        output = tf.layers.dense(inputs, self.w1_dim, activation=tf.nn.relu)
        output = tf.nn.dropout(output, 1.0 - self.dropout)
        output = tf.layers.dense(output, self.w2_dim)
        if padding is not None:
            with tf.name_scope("re_add_padding"):
                    output = tf.squeeze(output, axis=0)
                    output = tf.scatter_nd(indices=nonpad_ids, updates=output, shape=[batch_size * length, self.w2_dim])
                    output = tf.reshape(output, [batch_size, length, self.w2_dim])           
        return output

5. Shared embedding matrix

Transformer에서는 Encoder, Decoder, Output softmax layer에서 같은 embedding matrix를 사용한다.

transformer_model.py

def build_embed(self, inputs, encoder=True, reuse=False):
    with tf.variable_scope("Embeddings", reuse=reuse, initializer=tf.contrib.layers.xavier_initializer()):
        # Word Embedding
        self.shared_weights = tf.get_variable('shared_weights', [self.vocab, self.emb_dim], dtype = tf.float32)            

        if encoder:
            max_seq_length = self.max_enc_len
        else:
            max_seq_length = self.max_dec_len

        # Positional Encoding
        with tf.variable_scope("positional-encoding"):
            positional_encoded = model_utils.get_position_encoding(max_seq_length,
                                                                   self.emb_dim)
        batch_size = tf.shape(inputs)[0]
        mask = tf.to_float(tf.not_equal(inputs, 0))
        ## Add
        word_emb = tf.nn.embedding_lookup(self.shared_weights, inputs)   ## batch_size, length, dim
        word_emb *= tf.expand_dims(mask, -1) ## zeros out masked positions
        word_emb *= self.emb_dim ** 0.5 ## Scale embedding by the sqrt of the hidden size
        position_inputs = tf.tile(tf.range(0, max_seq_length), [batch_size])
        position_inputs = tf.reshape(position_inputs, [batch_size, max_seq_length])
        position_emb = tf.nn.embedding_lookup(positional_encoded, position_inputs)                       
        encoded_inputs = tf.add(word_emb, position_emb)
        return tf.nn.dropout(encoded_inputs, 1.0 - self.dropout)

6. Learning rate

Transformer에서는 warmup_step(4000)까지 linear하게 learning rate를 증가시키고, 이후에는 감소시킨다.

transformer_model.py

def noam_scheme(self, d_model, global_step, warmup_steps=4000):
    step = tf.cast(global_step + 1, dtype=tf.float32)
    return d_model ** (-0.5) * tf.minimum(step * warmup_steps ** -1.5, step ** -0.5)

7. Results

4 layer, 8 heads, 512 hidden dimension 에 관한 실험 결과 (BLEU : 24.74)
6 layer, 8 heads, 512 hidden dimension, 16000 warmup steps 에 관한 실험 결과 (BLEU : 27.57)
Beam search에 관한 것은 구현이 안된 상태이다.

1. Test loss

2. Test BLEU score

3. Example

Encoder Input ===> Tôi đã được lớn lên ở một quốc gia đã bị tiêu huỷ bởi bao thập niên chiến tranh .
Decoder True ===> I was raised in a country that has been destroyed by decades of war .
Decoder Pred ===> I grew up in a country that was <unk> by the war .
==========================================================================================
Encoder Input ===> Họ đều là nạn nhân của tổn thương , bệnh tật và bạo lực .
Decoder True ===> All of them are victim to injury , illness and violence .
Decoder Pred ===> They were all the victims of vulnerability , disease and violence .
==========================================================================================
Encoder Input ===> Và sau một vài tháng , tôi nhận ra rằng em khác biệt .
Decoder True ===> And after a few months went by , I realized that he was different .
Decoder Pred ===> And after a few months , I realized that you were different .