chengzeyi/stable-fast

Perf regression on A100 in v1.0.0+torch212+cu121+xformers0.23.post1 v.s. 0.0.13+torch2.0.0+cu121+xformers0.22patch7

Opened this issue · 2 comments

lcm: 18.5ms -> 25.0ms.

Same story with v1.0.0+torch2.1.1+cu121+xformers0.23
nightly release is even worse: (30ms)

When I use v.1.0.0 with torch 2.1.2 and xformers0.23.post1, I do not observe this issue. So the issue is with stable-fast.

Perf is worse even with 0.0.13 not compiling vae.encode (20.5ms).

Similar regression is observed for H100.

This shouldn't happen. What's your script?

When I run python3 examples/optimize_lcm_lora.py, I still see a significant speedup improvement. So I don't know what's wrong.