stanfordnlp/GloVe

cost jumps after some epochs

rpvelloso opened this issue · 7 comments

this happened in different machines, at different points during training. Couldn't find where this is happening in the source code.
Can't share the data :(

TRAINING MODEL
Read 563738072 lines.
Initializing parameters...Using random seed 1635360331
done.
vector size: 600
vocab size: 85748
x_max: 100
alpha: 0.75

10/27/21 - 03:59.19PM, iter: 1, cost: 0.071133
10/27/21 - 04:13.09PM, iter: 2, cost: 0.0551224
10/27/21 - 04:27.28PM, iter: 3, cost: 0.0505688
10/27/21 - 04:41.43PM, iter: 4, cost: 0.0454685
10/27/21 - 04:55.38PM, iter: 5, cost: 0.0416584
10/27/21 - 05:09.45PM, iter: 6, cost: 0.0392608
10/27/21 - 05:23.49PM, iter: 7, cost: 0.0376295
10/27/21 - 05:37.49PM, iter: 8, cost: 0.0365679
10/27/21 - 05:51.59PM, iter: 9, cost: 0.0359734
10/27/21 - 06:05.49PM, iter: 10, cost: 34.4493

do you have any idea about what might be happening here? I guess something about cost[], but I can't pinpoint it.

lowering eta to 0.01 helped:

10/27/21 - 07:38.53PM, iter: 1, cost: 0.0874979
10/27/21 - 07:53.09PM, iter: 2, cost: 0.069715
10/27/21 - 08:07.01PM, iter: 3, cost: 0.0608068
10/27/21 - 08:20.54PM, iter: 4, cost: 0.0554227
10/27/21 - 08:34.46PM, iter: 5, cost: 0.0514724
10/27/21 - 08:48.39PM, iter: 6, cost: 0.048066
10/27/21 - 09:02.32PM, iter: 7, cost: 0.0448908
10/27/21 - 09:16.53PM, iter: 8, cost: 0.0418435
10/27/21 - 09:31.28PM, iter: 9, cost: 0.0389835
10/27/21 - 09:46.21PM, iter: 10, cost: 0.0364648
10/27/21 - 10:01.12PM, iter: 11, cost: 0.0344151
10/27/21 - 10:16.08PM, iter: 12, cost: 0.0328401
10/27/21 - 10:31.01PM, iter: 13, cost: 0.0316272
10/27/21 - 10:45.54PM, iter: 14, cost: 0.0306751
10/27/21 - 11:00.39PM, iter: 15, cost: 0.0299081

maybe that was it? learning rate too large (I was using default value 0.05)

I've made some changes to deal with this issue during training:

TRAINING MODEL
Read 455065004 lines.
Initializing parameters...Using random seed 1635978159
done.
vector size: 600
vocab size: 85748
x_max: 100
alpha: 0.75
epochs: 40
eta: 0.05
11/03/21 - 07:39.33PM, iter: 1, cost: 0.0823551
11/03/21 - 07:56.42PM, iter: 2, cost: 0.064717
11/03/21 - 08:13.47PM, iter: 3, cost: 0.0599455
11/03/21 - 08:30.48PM, iter: 4, cost: 0.0536976
11/03/21 - 08:47.54PM, iter: 5, cost: 0.0487482
11/03/21 - 09:05.05PM, iter: 6, cost: 0.0456012
11/03/21 - 09:22.10PM, iter: 7, cost: 0.0435627
11/03/21 - 09:39.50PM, iter: 8, cost: 0.042235
11/03/21 - 09:56.59PM, iter: 9, cost: 43.8812
11/03/21 - 09:56.59PM cost increased, restoring last training checkpoint and lowering eta from 0.05 to 0.025
11/03/21 - 10:14.01PM, iter: 9, cost: 0.0361162
11/03/21 - 10:31.02PM, iter: 10, cost: 0.033401
11/03/21 - 10:48.04PM, iter: 11, cost: 0.0325761
11/03/21 - 11:05.09PM, iter: 12, cost: 0.0320624

I'm saving grad and weights each epoch, whenever cost increases I restore the last checkpoint and decrease learning rate by a decay rate. Seems to work.

another change I've made: changed from double to single precision resulted in a major speed up.
each epoch took 19min, now it's about 9min on the same machine.

also rewrote it in C++/STL to get rid of malloc's and possible memory leaks.

pushed my changes to my own cloned repo https://github.com/rpvelloso/GloVe

  1. vocab count & glove rewritten in C++ (will rewrite other modules later);
  2. glove runs twice as fast with fp16 math;
  3. rollback epoch if cost increases and decays 'eta'.

closing.