unable to reproduce attn_causal speeds
152334H opened this issue · 3 comments
Hello. I followed this guide to compile causal attention bindings for PyTorch. That is to say, I ran:
~/ThunderKittens$ source env.src
~/ThunderKittens$ cd examples/attn_causal
~/ThunderKittens/examples/attn_causal$ sed -i 's,// #define TORCH_COMPILE,#define TORCH_COMPILE,' h100_train.cu
~/ThunderKittens/examples/attn_causal$ sed -i 's,// #define TORCH_COMPILE,#define TORCH_COMPILE,' h100_fwd.cu
~/ThunderKittens/examples/attn_causal$ python h100_fwd_setup.py build && python h100_train_setup.py build # no build errors here
~/ThunderKittens/examples/attn_causal$ export PYHTONPATH="$PYTHONPATH":build/lib.linux-x86_64-cpython-310
~/ThunderKittens/examples/attn_causal$ python h100_train_check.py # all differences < 0.001
~/ThunderKittens/examples/attn_causal$ python h100_train_atn.py
Using device: cuda
Measure Performance for Backward Pass Only
Head Dim = 64, Seq Len = 1024, Heads = 16, Batch = 32
Average time taken: 843.89 us
Efficiency: 203.58 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Forward and Backward Pass
Head Dim = 64, Seq Len = 1024, Heads = 16, Batch = 32
Average time taken: 1141.82 us
Efficiency: 210.64 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Backward Pass Only
Head Dim = 64, Seq Len = 2048, Heads = 16, Batch = 32
Average time taken: 2570.54 us
Efficiency: 267.33 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Forward and Backward Pass
Head Dim = 64, Seq Len = 2048, Heads = 16, Batch = 32
Average time taken: 3535.34 us
Efficiency: 272.13 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Backward Pass Only
Head Dim = 64, Seq Len = 4096, Heads = 16, Batch = 32
Average time taken: 9397.16 us
Efficiency: 292.51 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Forward and Backward Pass
Head Dim = 64, Seq Len = 4096, Heads = 16, Batch = 32
Average time taken: 12822.46 us
Efficiency: 300.12 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Backward Pass Only
Head Dim = 64, Seq Len = 8192, Heads = 16, Batch = 32
Average time taken: 36052.26 us
Efficiency: 304.98 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Forward and Backward Pass
Head Dim = 64, Seq Len = 8192, Heads = 16, Batch = 32
Average time taken: 49443.63 us
Efficiency: 311.33 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Backward Pass Only
Head Dim = 64, Seq Len = 16384, Heads = 16, Batch = 32
Average time taken: 142140.59 us
Efficiency: 309.42 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Forward and Backward Pass
Head Dim = 64, Seq Len = 16384, Heads = 16, Batch = 32
Average time taken: 194657.64 us
Efficiency: 316.31 TFLOPS
______________________________________________________
These results are all significantly worse than what is reported in the graph.
Note:
- I added
export PYHTONPATH="$PYTHONPATH":build/lib.linux-x86_64-cpython-310
as the injectedsys.path.append('build/lib.linux-x86_64-3.10')
in h100_train_atn.py does not appear to match the build path I got. - I use Ubuntu 22.04 on a HGX H100 node with
550.54.15
drivers andcuda_12.4.r12.4/compiler.34097967_0
nvcc.
I have gcc14.0.0-1ubuntu1.1
and clang14.0.0-1ubuntu1.1
. - I tested building and running the python bindings on both pytorch
2.3.0+cu121
and2.4.0.dev20240512+cu124
.
Hi @152334H,
So, we noticed this gap that you are mentioning and traced it down to power throttling on our H100s - it would be worth checking to see if your H100s are power throttling too - more about this here: https://stackoverflow.com/questions/37419741/why-does-my-cuda-kernel-execution-time-increase-with-successive-launches
Our c++ harnesses for all kernels consistently achieve the TFLOP performance reported in the graphs, so it may be helpful to try those out if you're trying to reproduce performance. I also just pushed an updated PyTorch script that helps partially alleviate power throttling interfering with kernel timing.
Hi @Aaryan0404
I also encountered the same issue, where attention flops of h100_train_atn.py only reach up to 313 Tflops.
May I ask how you set the H100 and free it from power throttling...
I also tested the c/c++ version of attention example directly by "./attn_causal randn_causal_4096N_128D.txt", which returned 318 tflops. However, in your graph, the tflops should be 413 with a sequence length of 4096 and dim of 128.
The following is the frequency information of My GPU
Best regards,
Francis
Hi @CCrainys,
Thanks for your post - to clarify, the 413 TFLOP number you see in the graph is for the forward pass of causal attention (head dim = 128) - none of our attention backwards kernels (causal or non-causal) surpassed 335 TFLOPs (up to sequence length 16K) in our benchmarking, as our graphs reflect. h100_train_atn.py
is a benchmarking script for our backward pass kernels.
If you're looking to recreate the 413 TFLOP benchmarking from the graph (forward pass of causal attention, head dim = 128), you should be able to reproduce this by running the following python script (h100_fwd_atn.py
): https://github.com/HazyResearch/ThunderKittens/blob/main/examples/attn_causal/h100_fwd_atn.py.