hkchengrex/Cutie

VRAM usage with amp

Zarxrax opened this issue · 8 comments

I have been testing the amp setting, and I am a little confused by the result I am seeing. With cutie's default settings, I see less vram usage when amp: is disabled. Only when increasing the max_internal_size do I get any vram benefit from enabling it.
Each test run was conducted following a fresh restart of the application.

For a short clip with only 79 frames, it used 2.5gb with amp: True, and only 1.8gb with amp: False.

For a clip that is 1888 frames in length, I left all memory settings at defaults except I increased the long term memory size, so that I can measure the memory usage without it getting purged.
With amp: True, the entire clip completed, but it ended up right at 12gb, which is the limit for my gpu.
With amp: False, the entire clip processed and ended up using 11gb.

With the longer clip again, I increase the max_internal_size to 720. This time I did see a huge benefit for amp: True.
With amp: True it was able to process 160 frames before coming to a stop due to being out of vram.
With amp: False, it was only able to process about 65 frames.

Taking the max_internal_size back down a bit, to 540
With amp: True I was able to process 1055 frames
With amp: False I was able to process 755

So basically what I am seeing, is at max_internal_size of 480 or lower, amp is harmful to vram usage. Then the more you increase max_internal_size, the more benefit that is gained from it.
Can you confirm if this result makes sense? I am not sure if it could just be something peculiar to my own system, or if this result is expected.

Where are you viewing the VRAM usage from?

The gauge on the right panel, gpu mem, all proc, w/caching.

Yeah that's not an accurate measure of how much memory the program "needs". PyTorch aggressively caches, or takes more memory than it needs. The "torch, w/o caching" one is more important.

Alright thanks, I will review it some more.

I guess I am having trouble understanding how the one w/o caching matters?
image
When the one w/caching fills up, the processing slows to a crawl. I believe cpu mode runs faster at that point. The one w/o caching displays such a ridiculously small number, I thought Cutie must really be using more vram than what it displays.
Can this cache be cleared using something like torch.cuda.empty_cache(), or will that also clear useful data out of the memory?

Hmm I don't think I have seen that happen before. The program only has access to and is only using the GPU memory portion w/o caching. I cannot think of any reason for there to be a significant slowdown... In any case, it should either crash or continue running at the normal speed (swapping shouldn't be possible).

You can try torch.cuda.empty_cache() -- it is not going to purge any useful data. However I don't think it would help unless there is a PyTorch's bug.

The only thing I can think of is that I am on Windows, so maybe it handles the cache differently than on Linux. I am using Pytorch 2.2.

I tried adding the torch.cuda.empty_cache() when the vram got full, and it seemed to work well. There was a short pause while it cleared the cache, then it continued processing the next frames.
With max_internal_size of 720, it initially had to do this every couple hundred frames, but after clearing the cache a few times, then the vram usage suddenly stopped increasing and it continued to process the remainder of the video without stopping again.
The gauge displaying gpu mem w/o caching never went above 2gb.

Back to my original question about AMP. I was able to have someone else who is also using Windows test it as well. They did not have the same findings that I did. They found that AMP consistently filled less of their total vram.
So I guess I will just leave AMP always turned on. With emptying the cache, I no longer having any concern with the vram usage.

Glad to see that you have a working solution. Unfortunately, I am still not sure what is causing this problem.
Thank you for the detailed report and description -- future users with the same problem should find this issue of great help 😄