cost jumps after some epochs

Question

cost jumps after some epochs

rpvelloso opened this issue 3 years ago · 7 comments

this happened in different machines, at different points during training. Couldn't find where this is happening in the source code.
Can't share the data :(

TRAINING MODEL
Read 563738072 lines.
Initializing parameters...Using random seed 1635360331
done.
vector size: 600
vocab size: 85748
x_max: 100
alpha: 0.75

10/27/21 - 03:59.19PM, iter: 1, cost: 0.071133
10/27/21 - 04:13.09PM, iter: 2, cost: 0.0551224
10/27/21 - 04:27.28PM, iter: 3, cost: 0.0505688
10/27/21 - 04:41.43PM, iter: 4, cost: 0.0454685
10/27/21 - 04:55.38PM, iter: 5, cost: 0.0416584
10/27/21 - 05:09.45PM, iter: 6, cost: 0.0392608
10/27/21 - 05:23.49PM, iter: 7, cost: 0.0376295
10/27/21 - 05:37.49PM, iter: 8, cost: 0.0365679
10/27/21 - 05:51.59PM, iter: 9, cost: 0.0359734
10/27/21 - 06:05.49PM, iter: 10, cost: 34.4493

Answer 1 · 2021-10-27T22:48:55.000Z

I know the answer to this before I even ask, but is it a situation where you can give me temporary access to the machine where the training is happening?

Answer 2 · 2021-10-28T00:08:26.000Z

do you have any idea about what might be happening here? I guess something about cost[], but I can't pinpoint it.

Answer 3 · 2021-10-28T02:07:25.000Z

lowering eta to 0.01 helped:

10/27/21 - 07:38.53PM, iter: 1, cost: 0.0874979
10/27/21 - 07:53.09PM, iter: 2, cost: 0.069715
10/27/21 - 08:07.01PM, iter: 3, cost: 0.0608068
10/27/21 - 08:20.54PM, iter: 4, cost: 0.0554227
10/27/21 - 08:34.46PM, iter: 5, cost: 0.0514724
10/27/21 - 08:48.39PM, iter: 6, cost: 0.048066
10/27/21 - 09:02.32PM, iter: 7, cost: 0.0448908
10/27/21 - 09:16.53PM, iter: 8, cost: 0.0418435
10/27/21 - 09:31.28PM, iter: 9, cost: 0.0389835
10/27/21 - 09:46.21PM, iter: 10, cost: 0.0364648
10/27/21 - 10:01.12PM, iter: 11, cost: 0.0344151
10/27/21 - 10:16.08PM, iter: 12, cost: 0.0328401
10/27/21 - 10:31.01PM, iter: 13, cost: 0.0316272
10/27/21 - 10:45.54PM, iter: 14, cost: 0.0306751
10/27/21 - 11:00.39PM, iter: 15, cost: 0.0299081

maybe that was it? learning rate too large (I was using default value 0.05)

Answer 4 · 2021-11-04T02:12:39.000Z

I've made some changes to deal with this issue during training:

TRAINING MODEL
Read 455065004 lines.
Initializing parameters...Using random seed 1635978159
done.
vector size: 600
vocab size: 85748
x_max: 100
alpha: 0.75
epochs: 40
eta: 0.05
11/03/21 - 07:39.33PM, iter: 1, cost: 0.0823551
11/03/21 - 07:56.42PM, iter: 2, cost: 0.064717
11/03/21 - 08:13.47PM, iter: 3, cost: 0.0599455
11/03/21 - 08:30.48PM, iter: 4, cost: 0.0536976
11/03/21 - 08:47.54PM, iter: 5, cost: 0.0487482
11/03/21 - 09:05.05PM, iter: 6, cost: 0.0456012
11/03/21 - 09:22.10PM, iter: 7, cost: 0.0435627
11/03/21 - 09:39.50PM, iter: 8, cost: 0.042235
11/03/21 - 09:56.59PM, iter: 9, cost: 43.8812
11/03/21 - 09:56.59PM cost increased, restoring last training checkpoint and lowering eta from 0.05 to 0.025
11/03/21 - 10:14.01PM, iter: 9, cost: 0.0361162
11/03/21 - 10:31.02PM, iter: 10, cost: 0.033401
11/03/21 - 10:48.04PM, iter: 11, cost: 0.0325761
11/03/21 - 11:05.09PM, iter: 12, cost: 0.0320624

I'm saving grad and weights each epoch, whenever cost increases I restore the last checkpoint and decrease learning rate by a decay rate. Seems to work.

Answer 5 · 2021-11-09T19:30:16.000Z

another change I've made: changed from double to single precision resulted in a major speed up.
each epoch took 19min, now it's about 9min on the same machine.

Answer 6 · 2021-11-09T19:31:01.000Z

also rewrote it in C++/STL to get rid of malloc's and possible memory leaks.

Answer 7 · 2021-11-24T20:21:47.000Z

pushed my changes to my own cloned repo https://github.com/rpvelloso/GloVe

vocab count & glove rewritten in C++ (will rewrite other modules later);
glove runs twice as fast with fp16 math;
rollback epoch if cost increases and decays 'eta'.

closing.