Question about the acceleration of training

Question

Question about the acceleration of training

seekingup opened this issue 8 months ago · 2 comments

Thanks for the codebase. I have some questions regarding the acceleration of training.
I am currently running a dataset based on CIFAR, and the configuration is mostly similar to what is provided in the repository, except that 'uratio' has been increased to 7.
I found the training is pretty slow (takes 20 hours on a single GPU).
I have some attempts:

I set the num_workers to 4, but in htop, the load average exceeds 20. Therefore, on my server, compared to the GPU memory, the CPU is overloaded. Then it's difficult to run multiple programs simultaneously for me. In my impression, the training of CIFAR should not heavily burden the CPU. Could there be other configurations I overlooked to speedup the training?
I tried to increase the batch_size with the following code, (along with the learning rate), but the performance declined. Is there a problem with this way of modification?

if mul != 1:
    args.batch_size *= mul
    args.lr *= mul
    args.num_train_iter //= mul
    args.num_eval_iter //= mul

Answer 1 · 2023-11-21T14:38:11.000Z

Increasing 'uratio' to 7 means the unlabeled batch size would be very large. And since it usually uses "strong-weak" augmentation, the batch size is then doubled. You can try to increase the num_workers accordingly.

Although it's a common practice to linearly scale the learning rate according to batch size during pre-training, I'm not sure whether it is suitable here because some algorithms might be sensitive to the learning rate.

Answer 2 · 2023-11-23T03:16:21.000Z

Thanks for your reply. I will continue trying to improve the training efficiency. Indeed, it seems the uratio being too large is a major issue.