about training speed
LQchen1 opened this issue · 11 comments
Thank you for your excellent work, I am trying to reproduce your code, I ran MLLA-T on A100*8 card and found that it takes about 2 hours to run an epoch, May I ask if this training speed is normal? It currently only uses 24GB of GPU memory, and even if I increase batch_size, I think this time is still unacceptable.
Hi @LQchen1, thanks for your interest in our work.
I believe there is something wrong. When I train MLLA-T using 8*RTX3090, it takes about 10 minutes to finish one epoch. It may take less than 10 minutes to train MLLA-T for one epoch on 8*A100.
[2024-06-11 01:26:48 mlla_tiny] (main.py 185): INFO Start training
[2024-06-11 01:27:42 mlla_tiny] (main.py 296): INFO Train: [1/300][100/1251] eta 0:10:18 lr 0.000005 time 0.4640 (0.5371) loss 6.9216 (6.9214) grad_norm 0.4345 (0.4588) mem 14932MB
[2024-06-11 01:28:29 mlla_tiny] (main.py 296): INFO Train: [1/300][200/1251] eta 0:08:50 lr 0.000009 time 0.5026 (0.5040) loss 6.9210 (6.9169) grad_norm 0.4001 (0.4394) mem 14932MB
[2024-06-11 01:29:17 mlla_tiny] (main.py 296): INFO Train: [1/300][300/1251] eta 0:07:50 lr 0.000013 time 0.4652 (0.4941) loss 6.9226 (6.9148) grad_norm 0.4002 (0.4274) mem 14932MB
[2024-06-11 01:30:04 mlla_tiny] (main.py 296): INFO Train: [1/300][400/1251] eta 0:06:56 lr 0.000017 time 0.4729 (0.4890) loss 6.9135 (6.9124) grad_norm 0.3630 (0.4166) mem 14932MB
[2024-06-11 01:30:51 mlla_tiny] (main.py 296): INFO Train: [1/300][500/1251] eta 0:06:05 lr 0.000021 time 0.4781 (0.4857) loss 6.9038 (6.9103) grad_norm 0.4079 (0.4092) mem 14932MB
[2024-06-11 01:31:39 mlla_tiny] (main.py 296): INFO Train: [1/300][600/1251] eta 0:05:15 lr 0.000025 time 0.4626 (0.4837) loss 6.8826 (6.9072) grad_norm 0.4651 (0.4137) mem 14932MB
[2024-06-11 01:32:26 mlla_tiny] (main.py 296): INFO Train: [1/300][700/1251] eta 0:04:26 lr 0.000029 time 0.4630 (0.4820) loss 6.8112 (6.9009) grad_norm 0.6705 (0.4444) mem 14932MB
[2024-06-11 01:33:13 mlla_tiny] (main.py 296): INFO Train: [1/300][800/1251] eta 0:03:37 lr 0.000033 time 0.4755 (0.4810) loss 6.8075 (6.8910) grad_norm 1.0771 (0.5062) mem 14932MB
[2024-06-11 01:34:01 mlla_tiny] (main.py 296): INFO Train: [1/300][900/1251] eta 0:02:48 lr 0.000037 time 0.4675 (0.4801) loss 6.8490 (6.8783) grad_norm 1.4148 (0.5976) mem 14932MB
[2024-06-11 01:34:48 mlla_tiny] (main.py 296): INFO Train: [1/300][1000/1251] eta 0:02:00 lr 0.000041 time 0.4878 (0.4793) loss 6.7347 (6.8658) grad_norm 1.6200 (0.6977) mem 14932MB
[2024-06-11 01:35:36 mlla_tiny] (main.py 296): INFO Train: [1/300][1100/1251] eta 0:01:12 lr 0.000045 time 0.4628 (0.4792) loss 6.7812 (6.8526) grad_norm 2.6039 (0.7968) mem 14932MB
[2024-06-11 01:36:23 mlla_tiny] (main.py 296): INFO Train: [1/300][1200/1251] eta 0:00:24 lr 0.000049 time 0.4653 (0.4788) loss 6.7859 (6.8408) grad_norm 2.3491 (0.8946) mem 14932MB
[2024-06-11 01:36:48 mlla_tiny] (main.py 304): INFO EPOCH 1 training takes 0:09:59
When I ran the other code (Vmamba), I didn't find anything wrong with the server, sorry, I didn't mean that there was something wrong with your code, I wanted to reinstall the conda environment and try again, to be honest, I didn't exactly install the environment as required by your version. If there is a result I will report it. Of course, I would be grateful if you could provide a detailed configuration of the environment.
Here's part of my environment setup:
Hi @LQchen1, it seems that the initial batches are experiencing significant delays, resulting in a long estimated time. Perhaps you can wait until the first epoch is completed to see how long it really takes.
@tian-qing001 Thanks for your advice, I will try to keep it running, I guess the possible reason is that the data set is not stored on the local server, but on a shared server, but I did not find such a long initial batch time when running other code
And I noticed that the batch size you used when training VMamba is 4x that of training MLLA. It might be helpful to increase the batch size, i.e. --amp --batch-size 512
.
@tian-qing001 Yes, increasing the batch will reduce the time, but the initial epoch of Vmamba only takes about 15 minutes, which is 8× acceleration of the current code. This is obviously not a problem of increasing the batch, I will keep the same batch to run it to further troubleshoot the problem, and use --amp.
@LQchen1 Thank you for trying.
@tian-qing001
Hello, tian, I used two A100 cards to test their training speed with a batch size of 512. Here are the results:
Vmamba:
MLLA(use --amp):
I am very sad that I can't train MLLA fast, I will try to change the torch version next, is it possible that this is the reason for the slow MLLA training?
Hi @LQchen1.
It seems there might be an issue with the data loading process because your dataset isn't stored locally. I believe it is not possible for the model to run such slow with any version of PyTorch.
I also noticed you had a similar problem when training VMamba a few days ago. How did you fix that issue?
@tian-qing001
When I first did load slowly, I tried to use Vmamba advice, if found that training was slow in Vmamba, The cudnn acceleration program is forbidden, like this:
torch. Backends.cudnn. Enabled = False
torch.backends.cudnn.benchmark = False
The torch. Backends. cudnn. Deterministic = False,
but later I found after using cudnn, that the program can also be normal executive, that is to say, in fact, I didn't do anything.
May be I able to solve this problem by transferring the data locally now. I will try. Thank you for your suggestion.