HazyResearch/ThunderKittens

unable to reproduce attn_causal speeds

152334H opened this issue · 3 comments

Hello. I followed this guide to compile causal attention bindings for PyTorch. That is to say, I ran:

~/ThunderKittens$ source env.src
~/ThunderKittens$ cd examples/attn_causal
~/ThunderKittens/examples/attn_causal$ sed -i 's,// #define TORCH_COMPILE,#define TORCH_COMPILE,' h100_train.cu
~/ThunderKittens/examples/attn_causal$ sed -i 's,// #define TORCH_COMPILE,#define TORCH_COMPILE,' h100_fwd.cu
~/ThunderKittens/examples/attn_causal$ python h100_fwd_setup.py build && python h100_train_setup.py build # no build errors here
~/ThunderKittens/examples/attn_causal$ export PYHTONPATH="$PYTHONPATH":build/lib.linux-x86_64-cpython-310
~/ThunderKittens/examples/attn_causal$ python h100_train_check.py # all differences < 0.001
~/ThunderKittens/examples/attn_causal$ python h100_train_atn.py
Using device: cuda
Measure Performance for Backward Pass Only
Head Dim = 64, Seq Len = 1024, Heads = 16, Batch = 32
Average time taken: 843.89 us
Efficiency: 203.58 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Forward and Backward Pass
Head Dim = 64, Seq Len = 1024, Heads = 16, Batch = 32
Average time taken: 1141.82 us
Efficiency: 210.64 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Backward Pass Only
Head Dim = 64, Seq Len = 2048, Heads = 16, Batch = 32
Average time taken: 2570.54 us
Efficiency: 267.33 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Forward and Backward Pass
Head Dim = 64, Seq Len = 2048, Heads = 16, Batch = 32
Average time taken: 3535.34 us
Efficiency: 272.13 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Backward Pass Only
Head Dim = 64, Seq Len = 4096, Heads = 16, Batch = 32
Average time taken: 9397.16 us
Efficiency: 292.51 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Forward and Backward Pass
Head Dim = 64, Seq Len = 4096, Heads = 16, Batch = 32
Average time taken: 12822.46 us
Efficiency: 300.12 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Backward Pass Only
Head Dim = 64, Seq Len = 8192, Heads = 16, Batch = 32
Average time taken: 36052.26 us
Efficiency: 304.98 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Forward and Backward Pass
Head Dim = 64, Seq Len = 8192, Heads = 16, Batch = 32
Average time taken: 49443.63 us
Efficiency: 311.33 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Backward Pass Only
Head Dim = 64, Seq Len = 16384, Heads = 16, Batch = 32
Average time taken: 142140.59 us
Efficiency: 309.42 TFLOPS
______________________________________________________
Using device: cuda
Measure Performance for Forward and Backward Pass
Head Dim = 64, Seq Len = 16384, Heads = 16, Batch = 32
Average time taken: 194657.64 us
Efficiency: 316.31 TFLOPS
______________________________________________________

These results are all significantly worse than what is reported in the graph.


Note:

  • I added export PYHTONPATH="$PYTHONPATH":build/lib.linux-x86_64-cpython-310 as the injected sys.path.append('build/lib.linux-x86_64-3.10') in h100_train_atn.py does not appear to match the build path I got.
  • I use Ubuntu 22.04 on a HGX H100 node with 550.54.15 drivers and cuda_12.4.r12.4/compiler.34097967_0 nvcc.
    I have gcc 14.0.0-1ubuntu1.1 and clang 14.0.0-1ubuntu1.1.
  • I tested building and running the python bindings on both pytorch 2.3.0+cu121 and 2.4.0.dev20240512+cu124.

Hi @152334H,

So, we noticed this gap that you are mentioning and traced it down to power throttling on our H100s - it would be worth checking to see if your H100s are power throttling too - more about this here: https://stackoverflow.com/questions/37419741/why-does-my-cuda-kernel-execution-time-increase-with-successive-launches

Our c++ harnesses for all kernels consistently achieve the TFLOP performance reported in the graphs, so it may be helpful to try those out if you're trying to reproduce performance. I also just pushed an updated PyTorch script that helps partially alleviate power throttling interfering with kernel timing.

Hi @Aaryan0404
I also encountered the same issue, where attention flops of h100_train_atn.py only reach up to 313 Tflops.
May I ask how you set the H100 and free it from power throttling...

I also tested the c/c++ version of attention example directly by "./attn_causal randn_causal_4096N_128D.txt", which returned 318 tflops. However, in your graph, the tflops should be 413 with a sequence length of 4096 and dim of 128.

The following is the frequency information of My GPU
image

Best regards,
Francis

Hi @CCrainys,

Thanks for your post - to clarify, the 413 TFLOP number you see in the graph is for the forward pass of causal attention (head dim = 128) - none of our attention backwards kernels (causal or non-causal) surpassed 335 TFLOPs (up to sequence length 16K) in our benchmarking, as our graphs reflect. h100_train_atn.py is a benchmarking script for our backward pass kernels.

If you're looking to recreate the 413 TFLOP benchmarking from the graph (forward pass of causal attention, head dim = 128), you should be able to reproduce this by running the following python script (h100_fwd_atn.py): https://github.com/HazyResearch/ThunderKittens/blob/main/examples/attn_causal/h100_fwd_atn.py.