Issues
- 3
a small typo and fix
#1390 opened by liguohao96 - 0
4 Failing `test_flash_attn_output_fp8` tests on H100
#1404 opened by BioGeek - 0
- 0
Does bar.sync Emit Semaphores Alongside bar.arrive?
#1403 opened by ziyuhuang123 - 0
Understanding sync and arrive in FA3 Store Function
#1401 opened by ziyuhuang123 - 1
- 6
- 0
The execution order between GEMM0 of the next iteration and GEMM1 of the current iteration in Pingpong scheduling pipeline for overlapping gemms and softmax between warpgroups
#1398 opened by tengdecheng - 0
- 0
g2s K tensor when handling padding in the seq_k, clear it rather than keeping the default SMEM values.
#1395 opened by NVIDIA-JerryChen - 2
- 0
FA-3 installation errors
#1387 opened by asahni04 - 2
Why does NamedBarrier in epilogue use NumMmaThreads(256) + NumThreadsPerWarp(32)?
#1389 opened by ziyuhuang123 - 0
Windows 11 Installation Error
#1388 opened by 404-xianjin - 2
Accuracy Drop with Flash-Attention Reimplementation in Encoder-Decoder Architecture (ViT)
#1376 opened by ImaGonEs - 1
seq_lens variable used in the attention kernel
#1378 opened by chakpongchung - 0
How to get actual col idx
#1385 opened by wenkechen - 6
Possible to install with just `torch` installed?
#1379 opened by davidmezzetti - 0
[ROCm] benchmark_flash_attention.py failing with Memory Access Fault
#1381 opened by nikhil-tensorwave - 6
Flash attention 3 does not use Dropout_p?
#1377 opened by nighting0le01 - 2
- 2
FA3 for cuda12.3 but torch only releases cuda 12.4 version
#1375 opened by wplf - 2
Headdim==96 in FA3
#1374 opened by wplf - 1
- 1
Why we have a third barrier::QueryEmpty arrive?
#1372 opened by ziyuhuang123 - 2
Can wgmma.async and barrier.arrive Ensure GEMM Completion Before Moving Forward?
#1373 opened by ziyuhuang123 - 2
Question About Initial sync Behavior Without Prior arrive in Warpgroup Scheduling
#1371 opened by ziyuhuang123 - 2
Question about warp_scheduler_barrier_arrive in FA3 and cutlass::arch::NamedBarrier::arrive Usage
#1370 opened by ziyuhuang123 - 0
- 6
- 4
The byzantine copy of Tensor O
#1368 opened by phantaurus - 0
- 1
- 1
- 0
Add support for qk dim different from v dim in PR #1166
#1358 opened by YTianZHU - 4
Question of the equation in Flash Attention 2 Paper
#1349 opened by jeffrey-sunh1 - 0
- 1
Unable to cast Python instance of type <class 'torch._subclasses.fake_tensor.FakeTensor'> to C++ type
#1351 opened by zwhe99 - 2
- 0
How to assign ROCm architecture during pip installing
#1356 opened by deeptimhe - 0
Does flash-attn support FP8 inference on L40-48G?
#1355 opened by LinJianping - 0
Flashdecoding with appendKV might incorrect
#1354 opened by DD-DuDa - 1
FP8 test failure on the latest 'decode' branch
#1352 opened by cscyuge - 5
RuntimeError: Error compiling objects for extension
#1346 opened by beyondguo - 0
[Q] why flash attention MFU is over 100% in A800
#1345 opened by wonderisland - 1
breaking change for head size non divisble by 8
#1347 opened by felix-red-panda - 1
Issue with installing flash attention ` import flash_attn_2_cuda as flash_attn_cuda`
#1348 opened by hahmad2008 - 2
- 0
FA3 Failed to initialize the TMA descriptor
#1343 opened by li-yi-dong - 0
Assistance on implementing Flash Attention 2 for Turing
#1342 opened by samuelzxu