[Transformer Tutorial] Why is the activation function of the output layer linear instead of softmax?

Question

[Transformer Tutorial] Why is the activation function of the output layer linear instead of softmax?

nucweacia94fine opened this issue 2 years ago · 0 comments

Hi,

Thank you very much for providing such a detailed and comprehensive tutorial. I have successfully reconstructed a similar model on my own and successfully trained it using my own machine. (https://www.tensorflow.org/text/tutorials/transformer)
However, I am curious as to why the activation function of the output layer is linear. It's with softmax in the origianl article. (https://www.tensorflow.org/text/tutorials/transformer#the_transformer)
Below are the first 3 epochs training results comparing different output layer activation functions that I have tried:

3 epochs, output w/i linear activation func
810/810 [==============================] - 151s 180ms/step - loss: 6.6023 - masked_accuracy: 0.1445 - val_loss: 5.0209 - val_masked_accuracy: 0.2561
810/810 [==============================] - 142s 175ms/step - loss: 4.5676 - masked_accuracy: 0.3001 - val_loss: 4.0427 - val_masked_accuracy: 0.3591
810/810 [==============================] - 140s 172ms/step - loss: 3.8247 - masked_accuracy: 0.3805 - val_loss: 3.4593 - val_masked_accuracy: 0.4286
3 epochs, output w/i Softmax activation func
810/810 [==============================] - 146s 175ms/step - loss: 6.2600 - masked_accuracy: 0.0500 - val_loss: 6.2228 - val_masked_accuracy: 0.0494
810/810 [==============================] - 140s 173ms/step - loss: 6.1739 - masked_accuracy: 0.0619 - val_loss: 6.6027 - val_masked_accuracy: 0.0181
810/810 [==============================] - 150s 185ms/step - loss: 6.0836 - masked_accuracy: 0.0801 - val_loss: 7.1618 - val_masked_accuracy: 0.0221
3 epochs, output w/i ReLU activation func
810/810 [==============================] - 148s 177ms/step - loss: 6.5729 - masked_accuracy: 0.0555 - val_loss: 6.6944 - val_masked_accuracy: 0.0490
810/810 [==============================] - 153s 188ms/step - loss: 6.4000 - masked_accuracy: 0.0851 - val_loss: 6.5895 - val_masked_accuracy: 0.1009
810/810 [==============================] - 148s 183ms/step - loss: 6.1430 - masked_accuracy: 0.1430 - val_loss: 6.1474 - val_masked_accuracy: 0.1458
3 epochs, output w/i tanh activation func
810/810 [==============================] - 158s 189ms/step - loss: 7.6309 - masked_accuracy: 0.0427 - val_loss: 7.6151 - val_masked_accuracy: 0.0429
810/810 [==============================] - 147s 181ms/step - loss: 7.6147 - masked_accuracy: 0.0443 - val_loss: 7.6135 - val_masked_accuracy: 0.0493
810/810 [==============================] - 153s 189ms/step - loss: 7.6141 - masked_accuracy: 0.0455 - val_loss: 7.6140 - val_masked_accuracy: 0.0490

Sincerely