The Learning Rate in 5-2.BERT must be reduced.
Cheng0829 opened this issue · 0 comments
Cheng0829 commented
In Line 209:
optimizer = optim.Adam(model.parameters(), lr=0.001)
In practice, this BERT model is bound to fall into local convergence if the learning rate is 0.001;
I think the learning rate should be reduced to 0.0001.
The experimental results show that when the learning rate is 0.0001, after about 100 iterations, the loss value will be reduced to 0.1, while if the learning rate is 0.001, the loss value will almost never be less than 2.0.
when lr=0.01
Epoch: 0010 cost = 15.205759
Epoch: 0020 cost = 16.236261
Epoch: 0030 cost = 18.436878
Epoch: 0040 cost = 4.077913
Epoch: 0050 cost = 12.703120
Epoch: 0060 cost = 10.411244
Epoch: 0070 cost = 1.640913
Epoch: 0080 cost = 10.753708
Epoch: 0090 cost = 8.370532
Epoch: 0100 cost = 1.624577
Epoch: 0110 cost = 8.537676
Epoch: 0120 cost = 7.453298
Epoch: 0130 cost = 1.659591
Epoch: 0140 cost = 7.092763
Epoch: 0150 cost = 6.843360
Epoch: 0160 cost = 1.688111
Epoch: 0170 cost = 6.052425
Epoch: 0180 cost = 6.395712
Epoch: 0190 cost = 1.707749
Epoch: 0200 cost = 5.263054
······
Epoch: 5000 cost = 2.523541
when lr=0.0001
Epoch: 0010 cost = 13.998453
Epoch: 0020 cost = 6.168099
Epoch: 0030 cost = 3.504844
Epoch: 0040 cost = 2.312538
Epoch: 0050 cost = 1.723783
Epoch: 0060 cost = 1.412463
Epoch: 0070 cost = 0.930549
Epoch: 0080 cost = 0.671946
Epoch: 0090 cost = 0.745429
Epoch: 0100 cost = 0.139699
Epoch: 0110 cost = 0.187208
Epoch: 0120 cost = 0.075726