Nota-NetsPresso/BK-SDM

About the training speed

Closed this issue · 3 comments

KeyKy commented

I found that the total number of iterations for the training is 400,000. May I ask, how many days does it take for you to train a distilled model? I use 8*V100, I found that I can only complete around 3,800 iterations in one night (from 19:55 to 10:00 the next day).

KeyKy commented

With a batch size of 256 (=4×64), training BK-SDM-Base for 50K iterations takes about 300 hours and 53GB GPU memory. With a batch size of 64 (=4×16), it takes 60 hours and 28GB GPU memory.
Training BK-SDM-{Small, Tiny} results in 5∼10% decrease in GPU memory usage.

KeyKy commented

I seem that BK-SDM-Base will take 300h * (400K / 50K) == 2400h.

Hi, we would like to clarify our setting.

I found that the total number of iterations for the training is 400,000.

  • No. Although our script specifies --max_train_steps=400000, we released the checkpoints at the exact 50000-th step as described in our paper.
    • The reason for setting a longer max_train_steps was to inspect the impact of iterations on model performance.

I can only complete around 3,800 iterations in one night (from 19:55 to 10:00 the next day).

  • one night from 19:55 to 10:00 the next day = 14h
  • 50000 iter / 3800 iter * 14 h = 184.21 h

Though our models were trained on a single A100, using multiple GPUs with a smaller per-GPU batch size can accelerate training speeds.