microsoft/hlsl-specs

[SPIR-V] Expose maximal reconvergence

alan-baker opened this issue · 10 comments

Is your feature request related to a problem? Please describe.
SPIR-V released SPV_KHR_maximal_reconvergence. I would like to expose it in DXC.

Describe the solution you'd like
The GLSL extension will use an attribute on entry points so that work here (in the vk namespace). Equally palatable would be a command line switch to enable it. This might better match HLSL's desired subgroup semantics (documented here). Future-looking I would expect this to be enabled most of the time (e.g. automatically).

Describe alternatives you've considered
Attribute or command line option make the most sense.

Additional context

cc @s-perron

This seems like a great feature, although I'm a little confused about the use case, probably due to lacking understanding of SPIR-V's semantics.

In DXIL, if a compiler optimizations generally aren't allowed to move wave operations around within control flow. If a wave operation gets moved, that's a bug in DXC. Does that not apply to SPIR-V subgroup operations?

If SPIR-V has the same semantics, than this shouldn't be required for HLSL to have the same behavior between DXIL and SPIR-V because they should already match. If SPIR-V doesn't have that requirement, it seems like we should apply this execution mode on all shaders that use Wave* intrinsics, not just cases where it is opted into.

Without this feature nothing prevented compilers from moving subgroup operations in many cases. For example in loop peeling loops:

uint my_value = ...;
for (;;) {
  uint value = subgroupBroadcastFirst(my_value);
  if (my_value == value) {
    // Perform subgroup operations under the assumption
    // that only invocations with matching values will partcipate.
    uint out_value = subgroupExclusiveAdd(...);
    break;
  }
}

In that example, the add might be moved out of the loop. The new extension requires that add be performed such that the invocations included are those that were in that loop iteration together.

Empirically, we found that many devices did not support that behaviour by default. Some public testing results (done via webgpu) can be found here. All 3 major APIs had failures.

I agree it feels like it would be beneficial to always enable this feature, unfortunately, rollout of this feature will take time so it can't be added all time. That might lean towards using the command line option for now that gets defaulted to on in the future.

TL/DR: I believe this should be a DXC option that, for now, defaults to not using maximal convergence. Once most new Vulkan devices support maximal convergence, then we should flip the default to using it. No changes to HLSL should be necessary.

With regard to Alan's example, HLSL already has an informal specification saying that the subgroupExclusiveAdd. The problem is that the corresponding SPIR-V allows the possibility that all of the waves that started to have merged before executing the add. Without this new feature, we cannot force the drivers to do what is expected at the HLSL level. Because we are not making a change to the semantics of the HLSL, just fixing a bug in SPIR-V, I do not think we need a proposal in HLSL specs.

So as @llvm-beanz said, this should apply to all shaders in order to generate correct code for the HLSL.

However, we cannot enable it for all shaders yet because the feature is so new. There are essentially no drivers that accept the new extension. If we were to emit it all of the time, we would break everyone.

This is why I propose a DXC option that defaults to the status quo. When enough drivers have implemented the extension, we can flip the default for the option, and give people the correct behaviour all of the time.

I do not like an attribute because

  1. an attribute says make it seem like it is changing the semantics of the HLSL when it is not.
  2. It will be harder to change make the correct behaviour the default in the future. We would need to deprecate the old attribute, and introduce a new one for those who are targeting older hardware.
  3. It is also easier to build the same shader in both ways with an option for those that will want to target both. With an attribute, the user would have to use ifdefs in their code.

The attribute could work, but seems like it will be a little bit harder for everyone.

Hello!

Talked a bit with Alan, and yes, I agree with Steven, we should NOT add an attribute for that.
It seems like we should always add this extension/capability, as long as we have support for it. Which — as Steven — mentioned is not possible yet, as the extension was just released and driver support will take time to come.

So agreeing with Steven: let's add an option to DXC which adds this extension, and in a few months when the driver support is widespread enough, change the default to true.

note: Alan told me about the -modify-maximal-reconvergence pass in SPIRV-Tools which can be used to strip/add the extension to any SPIR-V module.

I'll let this issue open until Monday, and if nobody objects, I'll close it, and open an issue on DXC to add this flag.

I agree this isn't really a language change, although I'm actually really concerned that this means the SPIR-V backend for DXC doesn't correctly implement HLSL Wave operations. That seems extremely scary to me.

I'm also concerned that the definition of this flag is heavy-handed and may have much wider impact than just resolving the issues with the Wave intrinsics.

Is there a reason that compilers hoisting subgroup operations isn't just disallowed? That seems like the correct fix here. It looks to me like the SPIR-V definitions of subgroup operations seem to be accurate enough that optimizations of this nature change program meaning and should generally be considered bugs. Am I missing something?

For specific context to HLSL: LLVM has function attributes that allow denoting convergence requirements and memory semantics that enable LLVM to do the right thing (see: LLVM Language Reference: Function Attributes).

If you don't have some really rigourous testing, I wouldn't expect vendor drivers to implement HLSL's semantics correctly either. The traditional convergent attribute in LLVM is insufficient. It only prevents extra control dependencies from getting added to a function. That's where the experimental attributes arose. The common cases work if for no other reason than vendors have fixed common bugs users complain about. Testing found the more complicated cases are where the reliability is lost. I didn't have a wide range of D3D hardware to test on, but we saw failures. You could try enabling the extension and running the Vulkan CTS on your layered implementation and see if these tests pass across many devices.

For SPIR-V, the requirements around reconvergence and active invocations were never well defined. The core spec only required reconvergence at a merge node if all invocations were converged at the divergence. SPIR-V didn't allow for spontaneous divergence, but it didn't require reconvergence in enough places (nor did it prevent reconvergence in the right places). The extension formalized those missing pieces to provide a more intuitive behaviour.

There are some remaining gaps still. For example, switches have many possible implementations so we couldn't require a single behaviour. What does HLSL say about switches where multiple selector values branch to the same case? Are all invocations converged or just those with the same selector value? Should a fall through in a switch converge with invocations that branch directly to that case?

Here's the blog about the extension in case you're interested. It might help explain why it's necessary.

I agree this isn't really a language change, ...

Ok, so the intent and judgement is that HLSL is already supposed to work this way, and D3D is already supposed to work this way.

But experiments show real devices in the field don't work this way. gpuweb/gpuweb#4306 (comment)

Assuming those devices and drivers passed D3D qualification, then there is a gap in the D3D qualification suite.

We want developers to be able to rely on the good an intuitive behaviour. We want to expose it in WebGPU. But right now we can't make that promise. It would help to have a forward-looking assurance that eventually the good and intuitive behaviour will be commonplace.

If you don't have some really rigourous testing, I wouldn't expect vendor drivers to implement HLSL's semantics correctly either.

Sure... bugs happen. We have testing but things slip through and with Turing machines being what they are... bugs are going to keep happening.

The traditional convergent attribute in LLVM is insufficient. It only prevents extra control dependencies from getting added to a function.

LLVM's convergent attribute is one attribute that helps solve these kinds of issues, but not the only one. DXC doesn't rely on the convergent attribute and still generally manages to not mess up Wave operations in most cases (we do have a few known bugs, but those are bugs).

For SPIR-V, the requirements around reconvergence and active invocations were never well defined. The core spec only required reconvergence at a merge node if all invocations were converged at the divergence. SPIR-V didn't allow for spontaneous divergence, but it didn't require reconvergence in enough places (nor did it prevent reconvergence in the right places). The extension formalized those missing pieces to provide a more intuitive behaviour.

I guess I'm still a little confused. Even if SPIR-V didn't require convergence in some places obviously the spec shouldn't need to explicitly disallow optimizations that result in radical behavior changes. So if DXC generates SPIR-V with a form something like:

br <cnd>, a, b
A:
  ....
  br Exit
B:
  ...
  br Exit
Exit:
  Wave/Subgroup Op

Surely it would be a bug for an implementation to duplicate the Exit block into A or B because of the presence of the wave/subgroup operation. (this is basically the first example in your blog post).

There are some remaining gaps still. For example, switches have many possible implementations so we couldn't require a single behaviour. What does HLSL say about switches where multiple selector values branch to the same case? Are all invocations converged or just those with the same selector value? Should a fall through in a switch converge with invocations that branch directly to that case?

In the presence of Wave operations HLSL expects that control flow matches how the user wrote it. Anything other than that seems a bit insane. How do people write code that works if the compiler is allowed to just move things around ignoring the impact on program semantics?

Here's the blog about the extension in case you're interested. It might help explain why it's necessary.

I've read the blog. It doesn't really help me understand why adding a new feature was necessary as opposed to fixing compiler bugs. Unless you're saying those aren't really compiler bugs, in which case I'm a little terrified. We have some known bugs in DXC where we move Wave operations when we shouldn't but those are bugs (microsoft/DirectXShaderCompiler#6109 & microsoft/DirectXShaderCompiler#5302), and we should fix them.

All of this is a big aside. I think for the purposes of what we need to do for DXC this doesn't seem to be a language feature. If this is how the SPIR-V and Vulkan working groups have decided to address the issue DXC should insert the required execution mode implicitly for any shader that uses Wave or Quad operations, and may whatever deity you believe in have mercy on the souls of anyone using Wave or Quad operations on a driver that doesn't support this feature.

Ok, so the intent and judgement is that HLSL is already supposed to work this way, and D3D is already supposed to work this way.

But experiments show real devices in the field don't work this way. gpuweb/gpuweb#4306 (comment)

If you have HLSL or DXIL sources that you can share (publicly or privately) for cases where D3D drivers fail unexpectedly we would love to take a look and those and work with hardware partners to identify the issues.

Assuming those devices and drivers passed D3D qualification, then there is a gap in the D3D qualification suite.

I strongly suspect there are a great many gaps in our qualification suite, so this seems likely.

We want developers to be able to rely on the good an intuitive behaviour. We want to expose it in WebGPU. But right now we can't make that promise. It would help to have a forward-looking assurance that eventually the good and intuitive behaviour will be commonplace.

Our expectation is that the intuitive behavior is expected and should be commonplace right now for D3D. In places where the intuitive behavior isn't happening those are bugs, and we should track and address those bugs.

I opened microsoft/DirectXShaderCompiler#6222 to add the command line option.