[Bug]: [FMHA] Only partial results align with eager attention when using specific mask pattern

Question

[Bug]: [FMHA] Only partial results align with eager attention when using specific mask pattern

fenghuohuo2001 opened this issue 2 months ago · 3 comments

fenghuohuo2001 commented 2 months ago

System Info

CPU x86_64
RTX4090 24G

TensorRT10.8
Ubuntu22.04
NVIDIA-SMI 570.86.15
Driver Version: 570.86.15
CUDA Version: 12.8

Python 3.12.3
PyTorch 2.7.0

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

def FMHA_eager(qkv:torch.Tensor, mask:torch.Tensor, cu_seqlens:torch.Tensor, seqlen:int, fy_packed_mask:torch.Tensor) -> torch.Tensor:
dim = 1280
num_heads = 16
head_dim = 80
qkv = qkv.transpose(2, 1)
q, k, v = qkv.reshape(seqlen, 3, num_heads, -1).permute(1, 0, 2, 3).unbind(0)
q = q.transpose(0, 1)
k = k.transpose(0, 1)
v = v.transpose(0, 1)
attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(head_dim)
print("attn_weights.shape", attn_weights.shape)
print("mask.shape", mask.shape)
attn_weights = attn_weights + mask
attn_weights = nn.functional.softmax(
attn_weights, dim=-1, dtype=torch.float32
).to(q.dtype)
attn_output = torch.matmul(attn_weights, v)
attn_output = attn_output.transpose(0, 1)
attn_output = attn_output.reshape(seqlen, -1)
return attn_output

Expected behavior

When using TensorRT-LLM's fused multi-head attention (FMHA) with a custom mask, I encountered a result alignment issue compared to PyTorch's eager mode attention.

The mask is a 1024x1024 matrix with the following pattern:

Diagonal blocks of size 80x80 are filled with 0 (valid attention).
All other positions are filled with -inf (masked out).

When comparing the output of FMHA with eager mode attention:
The first 128 rows ([0:128, :]) of the FMHA output align with eager mode (within FP16 tolerance of ≤1e-3).
All rows beyond the first 128 ([128:1024, :]) show significant discrepancies (deviations far exceeding 1e-3).

actual behavior

eager_output tensor([[ 0.0676, 0.0859, -0.0989, ..., -0.0645, -0.2720, -0.0265],
[-0.1104, -0.2632, -0.3628, ..., 0.0553, -0.2137, -0.1802],
[-0.0919, -0.0333, -0.0382, ..., -0.0506, 0.2101, 0.0857],
...,
[ 0.1455, 0.0831, 0.2939, ..., -0.1184, 0.2710, -0.1012],
[-0.4053, 0.0419, 0.4592, ..., -0.0754, 0.2896, -0.0017],
[ 0.1105, -0.2498, 0.1514, ..., 0.1589, 0.1858, 0.6118]],
device='cuda:0', dtype=torch.float16)
fhma_output tensor([[ 6.7383e-02, 8.5938e-02, -9.8816e-02, ..., -6.4819e-02,
-2.7173e-01, -2.6459e-02],
[-1.1023e-01, -2.6294e-01, -3.6255e-01, ..., 5.5603e-02,
-2.1423e-01, -1.8030e-01],
[-9.1919e-02, -3.3295e-02, -3.8483e-02, ..., -5.0812e-02,
2.1045e-01, 8.5327e-02],
...,
[-5.4718e-02, -3.0396e-02, 5.5878e-02, ..., -1.2024e-02,
-7.6065e-03, -6.2683e-02],
[ 4.8065e-02, 1.2955e-02, 6.4278e-03, ..., 3.2139e-03,
-1.2108e-02, 3.1042e-04],
[ 1.0460e-02, 6.2866e-02, -7.6027e-03, ..., 8.1421e-02,
-1.5430e-01, -4.2755e-02]], device='cuda:0', dtype=torch.float16)

additional notes

Before submitting a new issue...

Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Answer 1 · 2025-09-03T11:07:45.000Z

@kaiyux @laikhtewari @juney-nvidia
Can you help me see the problem?

Answer 2 · 2025-09-04T07:26:57.000Z

I have a feeling that there is a problem with the update during the softmax chunking calculation

Answer 3 · 2025-09-08T03:21:58.000Z

I found that the problem occurs in this piece of code, and it was resolved after the modification.