kijai/ComfyUI-HunyuanVideoWrapper

Maximum frames/steps etc for 24GB card? Keep getting OOM

Opened this issue · 11 comments

As title, just wondering what we should be looking at, even with 720 I'm getting OOMs (sometimes it works if i restart comfy), so maybe something isn't be released after generation

edit 1:

I don't see how people are generating decent size/length videos, I'm only able to get to 624/832 size with 45 frames?

edit 2:

Best I can generate so far with 3090 and sage with block swap. This the best to be expected?

  • Swapping 20 double blocks and 0 single blocks
  • Sampling 97 frames in 25 latents at 544x960 with 30 inference steps
kijai commented

I only have experience with a 4090, 129 frames at 960x544 uses about ~22GB with torch.compile, without torch.compile it will oom, in both comfy native and this wrapper. Compile seems to have a huge effect on VRAM use, and is about 30% faster, but from what I hear compiling isn't working at fp8 on a 3090 and requires 40xx card.

With the wrapper sage/flash have proper memory use, sdpa implementation is highly inefficient and much better in comfy native.

You can additionally enable swapping for up to 40 single blocks too.

As to releasing the VRAM, it's always done when the force_offload is enabled in the node, but it is NOT done if you interrupt the process, so that can leave stuff on VRAM temporarily.

As title, just wondering what we should be looking at, even with 720 I'm getting OOMs (sometimes it works if i restart comfy), so maybe something isn't be released after generation

edit 1:

I don't see how people are generating decent size/length videos, I'm only able to get to 624/832 size with 45 frames?

edit 2:

Best I can generate so far with 3090 and sage with block swap. This the best to be expected?

* Swapping 20 double blocks and 0 single blocks

* Sampling 97 frames in 25 latents at 544x960 with 30 inference steps

Hey,

I have a GTX 1080 TI and i can do these settings, it takes a long time tho because of old GPU, but if you use the same settings how high can you crank the resolution/frames? (1080ti only has 11gb of vram)

bild

btw these settings take me 160s per iteration, im curious how long it takes for 3090 as i may buy one soon.

Hey,

I have a GTX 1080 TI and i can do these settings, it takes a long time tho because of old GPU, but if you use the same settings how high can you crank the resolution/frames? (1080ti only has 11gb of vram)

Currently I've only been testing for a few hours, but below takes 3.2min

Sampling 129 frames in 33 latents at 512x384 with 20 inference steps

@Fredd4e If you are on Linux installing Sage is pretty simple and apparently gives good time savings.

@kijai block swap keeps me under 24GB but then I get OOM on VideoDecode, so kind of defeats the point unless I'm missing something?

I also see that all of this methods are for low vram, I can't get anywhere near the suggested sizes from the model creators and I wouldn't consider 24GB to be low vram for a fp8 model

kijai commented

@kijai block swap keeps me under 24GB but then I get OOM on VideoDecode, so kind of defeats the point unless I'm missing something?

I also see that all of this methods are for low vram, I can't get anywhere near the suggested sizes from the model creators and I wouldn't consider 24GB to be low vram for a fp8 model

You can reduce the tile size on the decode node, works fine with 128 spatial (halves the VRAM use compared to default 256), keep the temporal at 64 though to avoid stuttering/ghosting in the result. Have to disable the auto_size for the adjustments to take effect.

The max resolution is very heavy, they did say that takes something like 60GB initially after all, so we are very much in "low VRAM" territory with 24GB.

@kijai block swap keeps me under 24GB but then I get OOM on VideoDecode, so kind of defeats the point unless I'm missing something?
I also see that all of this methods are for low vram, I can't get anywhere near the suggested sizes from the model creators and I wouldn't consider 24GB to be low vram for a fp8 model

You can reduce the tile size on the decode node, works fine with 128 spatial (halves the VRAM use compared to default 256), keep the temporal at 64 though to avoid stuttering/ghosting in the result. Have to disable the auto_size for the adjustments to take effect.

The max resolution is very heavy, they did say that takes something like 60GB initially after all, so we are very much in "low VRAM" territory with 24GB.

Makes sense, can't wait for a dual 5090 setup!

For me then for now, it seems I max out at 1024 x 1024 109 frames = 50minutes! Not really worth it, but I'm sure improvements will come soon.[

@Fredd4e If you are on Linux installing Sage is pretty simple and apparently gives good time savings.

I wish, currently i am on windows 10, i did give it a try to use sageattention - if i udnerstand correctly i need triton to run it, and triton does not seem to support my 1080ti.

However if you do still think it should work id love to go deeper.

I only have experience with a 4090, 129 frames at 960x544 uses about ~22GB with torch.compile, without torch.compile it will oom, in both comfy native and this wrapper. Compile seems to have a huge effect on VRAM use, and is about 30% faster, but from what I hear compiling isn't working at fp8 on a 3090 and requires 40xx card.

With the wrapper sage/flash have proper memory use, sdpa implementation is highly inefficient and much better in comfy native.

You can additionally enable swapping for up to 40 single blocks too.

As to releasing the VRAM, it's always done when the force_offload is enabled in the node, but it is NOT done if you interrupt the process, so that can leave stuff on VRAM temporarily.

Sorry to hijack a bit, but I'm trying to run torch.compile and get this error. Any ideas?
image

kijai commented

I only have experience with a 4090, 129 frames at 960x544 uses about ~22GB with torch.compile, without torch.compile it will oom, in both comfy native and this wrapper. Compile seems to have a huge effect on VRAM use, and is about 30% faster, but from what I hear compiling isn't working at fp8 on a 3090 and requires 40xx card.
With the wrapper sage/flash have proper memory use, sdpa implementation is highly inefficient and much better in comfy native.
You can additionally enable swapping for up to 40 single blocks too.
As to releasing the VRAM, it's always done when the force_offload is enabled in the node, but it is NOT done if you interrupt the process, so that can leave stuff on VRAM temporarily.

Sorry to hijack a bit, but I'm trying to run torch.compile and get this error. Any ideas? image

It looks like the bug in torch for Windows, it's probably going to be fixed in 2.6.0, for now you can manually edit the code as this PR indicates: https://github.com/pytorch/pytorch/pull/138992/files

That file would be in your venv or python_embeded folder, for example:

\python_embeded\Lib\site-packages\torch_inductor

I only have experience with a 4090, 129 frames at 960x544 uses about ~22GB with torch.compile, without torch.compile it will oom, in both comfy native and this wrapper. Compile seems to have a huge effect on VRAM use, and is about 30% faster, but from what I hear compiling isn't working at fp8 on a 3090 and requires 40xx card.
With the wrapper sage/flash have proper memory use, sdpa implementation is highly inefficient and much better in comfy native.
You can additionally enable swapping for up to 40 single blocks too.
As to releasing the VRAM, it's always done when the force_offload is enabled in the node, but it is NOT done if you interrupt the process, so that can leave stuff on VRAM temporarily.

Sorry to hijack a bit, but I'm trying to run torch.compile and get this error. Any ideas? image

It looks like the bug in torch for Windows, it's probably going to be fixed in 2.6.0, for now you can manually edit the code as this PR indicates: https://github.com/pytorch/pytorch/pull/138992/files

That file would be in your venv or python_embeded folder, for example:

\python_embeded\Lib\site-packages\torch_inductor

That did it. Thank you.