Perf regression on A100 in v1.0.0+torch212+cu121+xformers0.23.post1 v.s. 0.0.13+torch2.0.0+cu121+xformers0.22patch7
Opened this issue · 2 comments
jon-chuang commented
lcm: 18.5ms -> 25.0ms.
Same story with v1.0.0+torch2.1.1+cu121+xformers0.23
nightly release is even worse: (30ms)
When I use v.1.0.0 with torch 2.1.2 and xformers0.23.post1, I do not observe this issue. So the issue is with stable-fast.
Perf is worse even with 0.0.13 not compiling vae.encode (20.5ms).
Similar regression is observed for H100.
chengzeyi commented
This shouldn't happen. What's your script?
chengzeyi commented
When I run python3 examples/optimize_lcm_lora.py
, I still see a significant speedup improvement. So I don't know what's wrong.