papago_deeplearning_test

Properties

Python 3.6

tensorflow 2.3.0

Sections

1. Experimetal Design

2. Evaluation Metrics

3. Experimental results.

1. Experimental Design and Data exploration results

Data exploration

samples
Train data set: 7260
Test data set : 2000
data set length Train input

Train target

Test input

Test target
number of words in each set
Train dataset : 55
Test dataset : 609
max length of input
Train dataset : 83
Test dataset : 86
max length of outputs
Train dataset : 56
Test dataset : 56

For experiments, I used two basic models.

1. Using BERT context vector and stacked GRU decoders.
1. Transformer

1-1. BERT context vector and stacked GRU decoders.

Because data size is small, Used 6 multihead-attention layers instead of 12

For extracting vector space representation of natural languages

scenario 1) bring pretrained weights

scenario 2) learning from scratch

Used Huggingface's TFBertModel for ease implementation

Used 3 stacked GRU as decoders to generate text

1-2. Transformer model

Transformer has its own strength with self-attention, to attend various positions of the input sequence to compute representations

stacked self-attention : scaled dot product attention, multi-head attention

Scaled dot product attention

scaled by square root of the depth

Multi-head attention

consists with three parts

1. linear layer, 2. scaled-dot product attention, 3. final linear layer

Query, Key, Value are inputs and are put through linear layer before multi-head attention

Encoder

Multi-head attention + pointwise feed forward network

Decoder

Masked multi-head attention + multi-head attention + pointwise feed forward network

Evaluation Metric

Accuracy

This experiment used Accuracy as evaluation metric.

The target sequence is zero padded to match the max length.

Despite the fact that accuracy can be a problem when in unbalance problems in this problem because there are many zero padded tokens, but accuracy was used because the model was not trained by putting a mask at zero padding token in the target sequence.

1st model experimental results.

scenario 1 evaluation results.

loss function : categorical crossentropy

loss : 1784.4952

accuracy : 0.2438

test loss : 1534.7498

test accuracy : 0.2537

scenario 2 evaluation results.

loss function : categorical crossesntropy

loss : 1825.4198

accuracy : 0.2274

test loss : 1795.3251

test accuracy : 0.2537

2nd model experimental results.

Optimizer : Adam with beta_1 = 0.1, beta_2 = 0.1 and learning rate exponential decaying by 0.9 initialized at 0.00001

loss function : categorical crossentropy

loss : 1.2822

accuracy : 0.8250

test loss : 1.2836

test accuracy : 0.8249

plots of loss and accuracy

plot of loss

plot of accuracy

Test accuracy is slightly higher than trian accuracy

Additional Experiment

After training my model, I implemented additional experiment. Therefore, the final results of 2nd model yielded much better results.

loss : 0.9736

accuracy : 0.8507

test loss : 0.9210

test accuracy : 0.8490

As a result, this 2nd model has potential to be a good translation model.

Improved 2nd model weights link

https://drive.google.com/drive/folders/1hTlrdRGp9zzNuo5SVNTD7ek9rAE5eStx?usp=sharing

Parkjunghoons2/papago_deeplearning_test

papago_deeplearning_test

Properties

Python 3.6

tensorflow 2.3.0

Sections

1. Experimetal Design

2. Evaluation Metrics

3. Experimental results.

1. Experimental Design and Data exploration results

1-1. BERT context vector and stacked GRU decoders.

Because data size is small, Used 6 multihead-attention layers instead of 12

For extracting vector space representation of natural languages

scenario 1) bring pretrained weights

scenario 2) learning from scratch

Used Huggingface's TFBertModel for ease implementation

Used 3 stacked GRU as decoders to generate text

1-2. Transformer model

Transformer has its own strength with self-attention, to attend various positions of the input sequence to compute representations

stacked self-attention : scaled dot product attention, multi-head attention

Scaled dot product attention

Multi-head attention

consists with three parts

1. linear layer, 2. scaled-dot product attention, 3. final linear layer

Query, Key, Value are inputs and are put through linear layer before multi-head attention

Encoder

Multi-head attention + pointwise feed forward network

Decoder

Masked multi-head attention + multi-head attention + pointwise feed forward network

Evaluation Metric

Accuracy

This experiment used Accuracy as evaluation metric.

The target sequence is zero padded to match the max length.

Despite the fact that accuracy can be a problem when in unbalance problems in this problem because there are many zero padded tokens, but accuracy was used because the model was not trained by putting a mask at zero padding token in the target sequence.

1st model experimental results.

scenario 1 evaluation results.

loss function : categorical crossentropy

loss : 1784.4952

accuracy : 0.2438

test loss : 1534.7498

test accuracy : 0.2537

scenario 2 evaluation results.

loss function : categorical crossesntropy

loss : 1825.4198

accuracy : 0.2274

test loss : 1795.3251

test accuracy : 0.2537

2nd model experimental results.

Optimizer : Adam with beta_1 = 0.1, beta_2 = 0.1 and learning rate exponential decaying by 0.9 initialized at 0.00001

loss function : categorical crossentropy

loss : 1.2822

accuracy : 0.8250

test loss : 1.2836

test accuracy : 0.8249

plots of loss and accuracy

plot of loss

plot of accuracy

Test accuracy is slightly higher than trian accuracy

Additional Experiment

After training my model, I implemented additional experiment. Therefore, the final results of 2nd model yielded much better results.

loss : 0.9736

accuracy : 0.8507

test loss : 0.9210

test accuracy : 0.8490

As a result, this 2nd model has potential to be a good translation model.

Improved 2nd model weights link