Changing epoch size and resulting number of epochs when running train.py

Question

Changing epoch size and resulting number of epochs when running train.py

benliu961 opened this issue 2 years ago · 3 comments

Hello,

I am trying to fine tune a stockfish model with my training dataset of about 400 million positions and validation dataset of about 50 million positions. When I tried running it with "python train.py --batch-size 16384 --threads 2 --num-workers 2 --gpus 1 train_data.binpack val_data.binpack" I ended up getting 999 epochs which seems to be way too large.

How would I best run train.py? I assume that I would set epoch size and validation size to something other than the default, but I am not sure what I should change it to. I also don't know how to affect the resulting total number of epochs.

Thank you!

Answer 1 · 2023-04-07T19:12:05.000Z

We defined an epoch to be 100M samples because the typical definition (size of the dataset) is useless as a metric. Also I believe default validation step size is 1M it doesn't really matter overall, validation steps don't impact the network in any way. For the current network architecture it takes around 400-600 epochs to reach saturation.

see https://github.com/glinscott/nnue-pytorch/wiki/Basic-training-procedure-(train.py), official-stockfish/Stockfish@c079acc, (and maybe https://github.com/glinscott/nnue-pytorch/wiki/Basic-training-procedure-(easy_train.py) )

Answer 2 · 2023-04-07T21:00:49.000Z

We defined an epoch to be 100M samples because the typical definition (size of the dataset) is useless as a metric. Also I believe default validation step size is 1M it doesn't really matter overall, validation steps don't impact the network in any way. For the current network architecture it takes around 400-600 epochs to reach saturation.

see https://github.com/glinscott/nnue-pytorch/wiki/Basic-training-procedure-(train.py), official-stockfish/Stockfish@c079acc, (and maybe https://github.com/glinscott/nnue-pytorch/wiki/Basic-training-procedure-(easy_train.py) )

Is this really a good decision to change what an epoch is? Since the definition contradicts the definition established by the majority of the ML community. Wouldn't it be better to define it as some other term: iteration/data_step, perhaps?

Answer 3 · 2023-04-07T21:21:22.000Z

iteration has a well established meaning, that is processing of one batch. data_step doesn't tell me anything. Whatever we choose will clash with pytorch naming at API level. I'd rather the ML community makes actually useful definitions rather than us trying to fix their mistakes.