The issue of cudnn affecting speed

Question

The issue of cudnn affecting speed

lp-094 opened this issue 8 months ago · 11 comments

Hello, we tried to turn off cudnn, but there was no improvement in speed. Does it need to be in a specific version (such as V3) to be effective?

Answer 1 · 2024-05-22T02:10:36.000Z

Do not worry, because it only happens to some machines (I do not get the pattern actually why a machine will be slow, maybe it is related to the driver or library).

using torch.backends.cudnn.enabled=True in downstream tasks may be quite slow. If you found vmamba quite slow in your machine, disable it in vmamba.py, else, ignore this.

Answer 2 · 2024-06-04T10:18:57.000Z

I had a similar problem, even though I set it to False in vmamba.py, its training time was still unstable, I used pytorch==2.0.1, python=3.11, V100

Answer 3 · 2024-06-04T11:43:40.000Z

Do not worry, because it only happens to some machines (I do not get the pattern actually why a machine will be slow, maybe it is related to the driver or library).

using torch.backends.cudnn.enabled=True in downstream tasks may be quite slow. If you found vmamba quite slow in your machine, disable it in vmamba.py, else, ignore this.

In fact, when I used an A100*8 card with batch size set to 512, it took me 2 hours to run an epoch。

Answer 4 · 2024-06-04T11:49:16.000Z

That seems not possible. What environment you are using?

Answer 5 · 2024-06-04T11:51:39.000Z

Also, for 8xV100, the time is about 10mins per epoch.

Answer 6 · 2024-06-04T12:20:59.000Z

Also, for 8xV100, the time is about 10mins per epoch.

I re-executed the program, and the first round took a long time, but the subsequent training was normal. However, I set torch.backends.cudnn.enabled = True. Could you let me know if this has a big impact on the model's performance?

Answer 7 · 2024-06-04T12:30:12.000Z

It is still weird somehow, and I do not know why the first round would be abnormal.
In my experiments, all epochs' time-consumption are similar, while all the first iter in each epoch is slow, as the program need to load the data from the very beginning.

Enable or disable cudnn may influence the performance, but I think the difference is tolerable.

Answer 8 · 2024-06-04T12:46:13.000Z

Looking at the logs, it seems to be the data loading, which took a lot of time because we were putting the data on another server and making it available to each server through data sharing.

Answer 9 · 2024-06-04T16:08:41.000Z

@MzeroMiko

The configuration file I used is vmambav2_tiny_224.yaml. I contrasted the current log with the author's log and found that the accuracy of EMA was much lower than expected. My EMA has an accuracy of 0.29%, but the author has an accuracy of 6.08% for the same epoch, my emaacc updates very slowly, what is the cause of this, I don't know much about how EMA works.

Answer 10 · 2024-06-05T03:38:38.000Z

Oh, it is because the batch_size you use is much bigger than mine. EMA means that in every iter, the parameter is updated with the latest version. The batch_size is smaller, the more frequently the ema parameter is updated, and the higher performance it'll get in a certain epoch.

But I can not predict what'll happen in the last 50 epochs, as the training start to converge. You may get higher performance with this batchsize.

Answer 11 · 2024-06-05T08:12:09.000Z

@MzeroMiko Thanks for your reply, it is now back to normal.