ROCm/hipamd

Implement float/double atomicMin/Max in terms of integer atomics

FreddieWitherden opened this issue · 12 comments

Currently, HIP implements atomicMin/Max for single and double precision floating point values as CAS loops. However, in fast math scenarios, on architectures with hardware support for signed/unsigned integer atomicMin/Max a better implementation is possible. As per https://stackoverflow.com/a/72461459 for single precision:

__device__ __forceinline__ float atomicMinFloat(float* addr, float value) {
    float old;
    old = !signbit(value) ? __int_as_float(atomicMin((int*)addr, __float_as_int(value))) :
        __uint_as_float(atomicMax((unsigned int*)addr, __float_as_uint(value)));

    return old;
}

__device__ __forceinline__ float atomicMaxFloat(float* addr, float value) {
    float old;
    old = !signbit(value) ? __int_as_float(atomicMax((int*)addr, __float_as_int(value))) :
        __uint_as_float(atomicMin((unsigned int*)addr, __float_as_uint(value)));

    return old;
}

Better implementations still are possible on NVIDIA using Opportunistic Warp-level Programming wherein one first looks to see if any other active threads in the warp have the same addr, and if so first do the reduction at the warp level. This greatly cuts down the number of RMW operations which leave the core when there is contention. I suspect a similar idea can carry over to AMD GPUs.

Epliz commented

(not an AMD employee)
Your first proposal seems like a good idea for hardware that doesn't have instructions for them (some architectures do have instructions for these atomics).

The second one is maybe best left to the application developers who will know if there will be memory address conflicts?
It might be also good to consider that things seem to get complicated if the return value of the atomic function is actually used.
Eventually might be a good idea however to have a compiler optimization for the case where the address is known to be uniform, and in that case add the warp reduce first, maybe for the simple case where the return value is not used.

Epliz commented

Maybe interesting to @arsenm from a compiler perspective and who can maybe recommend someone to have a look at for the HIP implementation of the atomics?

The second one is maybe best left to the application developers who will know if there will be memory address conflicts?

This would be a reasonable position if HIP exposed the necessary primitives for application developers to implement the functionality efficiently themselves. For example, on NVIDIA hardware, one would do something along the lines of:

__device__ int atomicAggInc(int *ptr) {
    int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
    int leader = __ffs(mask) – 1;    // select a leader
    int res;
    if(lane_id() == leader)                  // leader does the update
        res = atomicAdd(ptr, __popc(mask));
    res = __shfl_sync(mask, res, leader);    // get leader’s old value
    return res + __popc(mask & ((1 << lane_id()) – 1)); //compute old value
}

Here, most of the heavy lifting is done by the SM7.0+ primitive __match_any_sync which partitions threads in the warp according to their address. This means in the case of no conflicts the overhead from aggregation is only a couple of extra instructions (insignificant compared with the latency associated with atomic operations). HIP, however, does not expose such functionality and so you're guaranteed a sub-optimal code path on recent NVIDIA hardware if you implement it yourself. Similarly, there may be a means of efficiently implementing such operations on AMD GPUs using device specific instructions which are not currently exposed through HIP.

Epliz commented

I don't know if for cuda there are really special SSA instructions for the __match_any builtins, but they can be implemented without as shown in an answer at https://stackoverflow.com/questions/59879285/whats-the-alternative-for-match-any-sync-on-compute-capability-6 .
On AMD GPUs, they could be implemented like that as well.

I don't know if for cuda there are really special SSA instructions for the __match_any builtins, but they can be implemented without as shown in an answer at https://stackoverflow.com/questions/59879285/whats-the-alternative-for-match-any-sync-on-compute-capability-6 . On AMD GPUs, they could be implemented like that as well.

It reduces to the PTX match.any.sync.b64 which, on SM70, is turned into the SASS MATCH.ANY.U64 <RA>, <RB>. No reason on a recent NVIDIA GPU that one should be emulating this with a for loop.

Tricks like turning FP atomics into integer atomics would be best done in the backend (though FP atomicrmw does not have fast math flags, so that's a bit annoying).

We do have a pass which tries to rewrite uniform atomics to do a single atomic op and a reduction in the wave. It's currently not enabled by default.

Epliz commented

Hi @FreddieWitherden ,

Would be interested in seeing what you think about results you can get with a benchmark like the following:

#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>

#include <iostream>
#include <chrono>

#define DIV_ROUND_UP(a, b) (((a) + (b) - 1) / (b))

inline void gpuAssert(hipError_t code, const char *file, int line, bool abort=true)
{
  if (code != hipSuccess) {
	  std::cerr<<"GPUassert: "<<hipGetErrorString(code)<<" "<<file<<" "<<line<<std::endl;
    if (abort) exit(code);
  }
}
#define hipCheckError(ans) { gpuAssert((ans), __FILE__, __LINE__); }


template<typename T>
__global__
void vecadd(T* a, T* counters, int N, int s) {
  int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
  if (globalThreadId < N) {
    T retValue = atomicAdd(&counters[globalThreadId % s], T(1));
    a[globalThreadId] = retValue; // to force variants that return values
  }
}

template<typename T>
__global__
void vecmin(T* a, T* counters, int N, int s) {
  int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
  if (globalThreadId < N) {
    T retValue = atomicMin(&counters[globalThreadId % s], -T(globalThreadId));
    a[globalThreadId] = retValue; // to force variants that return values
  }
}

__device__
float customAtomicMin(float* addr, float value) {
  float old;
  old = !signbit(value) ? __int_as_float(atomicMin((int*)addr, __float_as_int(value))) :
      __uint_as_float(atomicMax((unsigned int*)addr, __float_as_uint(value)));

  return old;
}

__device__
int customAtomicMin(int* addr, int value) {
  return atomicMin(addr, value);
}

template<typename T>
__global__
void vecminCustom(T* a, T* counters, int N, int s) {
  int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
  if (globalThreadId < N) {
    T retValue = customAtomicMin(&counters[globalThreadId % s], -T(globalThreadId));
    a[globalThreadId] = retValue; // to force variants that return values
  }
}

using namespace std::chrono;

template<typename F, typename R>
void bench_it(size_t num_warmups, size_t num_repeats, F func, R report_func) {

    auto start_warmup = high_resolution_clock::now();  
    for (int r = 0; r < num_warmups; r++) {
      func();
    }
    
    hipCheckError(hipDeviceSynchronize());
    auto stop_warmup = high_resolution_clock::now();
    auto duration_warmupns = (double) duration_cast<nanoseconds>(stop_warmup - start_warmup).count();
    std::cout <<"time warmup: "<< duration_warmupns / 1000000.0 << "ms"<< std::endl;

    // measurement  
    auto start = high_resolution_clock::now();  
  
    for (int r = 0; r < num_repeats; r++) {
      func();
    }

    hipCheckError(hipDeviceSynchronize());
    auto stop = high_resolution_clock::now();
    auto durationns = duration_cast<nanoseconds>(stop - start);
    
    report_func((double) durationns.count() / num_repeats);
}

enum bench_data_type {
  FLOAT = 0,
  INT = 1
};

enum bench_op {
  OP_ADD = 0,
  OP_MIN = 1
};


template<typename T>
void run_benchmark(int N, int s, bench_op op, bool custom) {
  std::cout<<"stride: "<<s<<std::endl;
  
  if (op == OP_ADD) {
    std::cout<<"Testing add"<<std::endl;
  } else if (op == OP_MIN) {
    std::cout<<"Testing min"<<std::endl;
  }
  if (custom) {
    std::cout<<"custom approach if available"<<std::endl;
  }

  size_t size = N * sizeof(T);
  size_t size_counters = s * sizeof(T);

  T* a_device;
  hipCheckError(hipMalloc((void**) &a_device, size));
  hipCheckError(hipMemset((void*)a_device, 0, size));

  T* counters_device;
  hipCheckError(hipMalloc((void**) &counters_device, size_counters));
  hipCheckError(hipMemset((void*)counters_device, 0, size_counters));

  bench_it(
    /*warmups*/ 10,
    /*repetitions */ 100,
    /*function*/ [&]() {
      size_t threads_per_block = 64;
      size_t num_blocks = DIV_ROUND_UP(N, threads_per_block);
      dim3 block_dim(threads_per_block);
      dim3 grid_dim(num_blocks);
      
      if (op == OP_ADD) {
        vecadd<T><<<grid_dim, block_dim, 0, 0>>>(a_device, counters_device, N, s);
      } else if (op == OP_MIN) {
        if (custom) {
          vecminCustom<T><<<grid_dim, block_dim, 0, 0>>>(a_device, counters_device, N, s);
        } else {
          vecmin<T><<<grid_dim, block_dim, 0, 0>>>(a_device, counters_device, N, s);
        }
      }
      
      hipMemsetAsync((void*)counters_device, 0, size_counters, 0);
      
    },
    /*report function*/[=](double durationns){
      double bandwidth = (size) / durationns;
      std::cout <<"time: "<< durationns / 1000000.0 << "ms"<< std::endl;
      std::cout <<"bandwidth: "<< bandwidth << "GB/s"<< std::endl;
    });

  hipFree(counters_device);
  
  hipFree(a_device);
}

int main(int argc, char** argv) {

  // size, type
  if (argc < 6) {
    std::cout<<"Missing argumemts. Usage: <size> <stride> <op> <type> <custom>"<<std::endl;
    return -1;
  }
  
  size_t N = atoi(argv[1]);
  size_t s = atoi(argv[2]);
  bench_op op = (bench_op) atoi(argv[3]);
  bench_data_type type = (bench_data_type) atoi(argv[4]);
  bool custom = atoi(argv[5]);
  
  int num_devices;
  hipGetDeviceCount(&num_devices);
  
  std::cout<<"Num devices: "<<num_devices<<std::endl;
  
  for (int d = 0; d < num_devices; d++) {
    std::cout<<"Device "<<d<<std::endl;
    hipSetDevice(d);
    
    if (type == FLOAT) {
      std::cout<<"Testing float"<<std::endl;
      run_benchmark<float>(N, s, op, custom);
    } else if (type == INT) {
      std::cout<<"Testing int"<<std::endl;
      run_benchmark<int>(N, s, op, custom);
    } else {
      std::cout<<"unknown type"<<std::endl;
      return -1;
    }
  }
  return 0;
}

What I observe when running this on a gfx90c iGPU (Ryzen 5800U) is the following:

  • for the add operation, all operations need to be performed no matter what, and when a CAS loop is needed (for float) it is ridiculously slow. While when there is a dedicated instruction (for ints) it is OK. For these, a local reduction is probably a good idea in all cases as the L2 cache atomic units are the bottleneck and it would be OK to spend some instructions in the core to reduce requests to L2.
  • for the min operation, CAS can be actually better than playing the atomic int instructions trick allegedly because there is potentially much less atomic operations forwarded at the L2 cache level because the core did not issue the CAS for the threads that cannot lower the value: the CAS loop already does the local reduction to some extent!

These observations seem to correspond to the ones at section 5 of https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p0493r3.pdf : atomic adds are "read-modify-write", but atomic min/max are "read-and-conditional-store".

So for atomic min/max, it might not be better to just re-implement them with the approach in your first message: CAS might be better. But maybe if combining with the local reduction in your next message, it might be better than CAS?
I will try that next, when time permits.

EDIT: fixed code to reset counters between runs so that work happens

So for atomic min/max, it might not be better to just re-implement them with the approach in your first message: CAS might be better.

I think the issue is more subtle than that. Everything you said about a CAS loop for float min/max (avoiding the write if our local value is larger/smaller than the one we just read in from global memory) is also true for int/uint min/max. Thus, if you repeated your experiments with an array of integers I think you would find a CAS based approach outperforming HIP's native functions.

Here, there are two possibilities (i) that the HIP min/max functions for integers are implemented poorly on your platform, or (ii) that they've been optimised for a different level of contention than your test program exhibits. The first case is certainly possible given most HIP development is focused on CDNA rather than GCN and RDNA.

Epliz commented

I had a bug in my benchmark where actually no work was happening for later iterations due to the counters being changed not being reset.

After fixing that, if I run my benchmark and vary the contention by changing the stride parameter, we can see that the CAS is only better for the highest level of contention (stride == 1).
Here are some results for MI100 (CDNA1 architecture), RX6700XT (RDNA2 architecture):

  array size 4194304            
                 
  GPU stride            
Method   1 4 16 64 256 1024 2048
  MI100              
CAS   8.95GB/s 9.33GB/s 10.20GB/s 16.19GB/s 34.50GB/s 77.62GB/s 148.0GB/s
Integer atomics   5.797GB/s 22.80GB/s 27.67GB/s 87.19GB/s 166.9GB/s 187.9GB/s 187.7GB/s
  RX 6700XT              
CAS   10.52GB/s 31.63GB/s 65.37GB/s 55.45GB/s 91.57GB/s 96.86GB/s 100.0GB/s
Integer atomic   10.07GB/s 39.06GB/s 95.43GB/s 95.42GB/s 123.8GB/s 119.6GB/s 119.7GB/s
  Ryzen 5800U              
CAS   2.75GB/s 11.51GB/s 12.79GB/s 20.73GB/s 25.86GB/s 27.90GB/s 30.73GB/s
Integer atomic   7.42GB/s 23.80GB/s 28.25GB/s 40.45GB/s 41.24GB/s 40.95GB/s 35.27GB/s

So indeed, might be better to implement atomic min/max that way.

I'll note that these ideas also extend to double precision, with one catch. HIP does not provide long long versions of min/max. We get unsigned long long, but not the signed versions. Which is strange, as CUDA provides them, as do AMD themselves under OpenCL.

Epliz commented

Given that it is available in CUDA, if you create a ticket just for this, they will probably fix it.

Epliz commented

@FreddieWitherden , if you are interested, here is an example implementation of __match_any that should correspond to what you need to implement the atomicAddAgg:

// the compiler miscompiles without this in many cases as it doesn't understand
// when it can or cannot hoist calls
// cf. https://github.com/llvm/llvm-project/issues/62477
static inline __device__ int  __optimization_barrier_hack(int in_val)
{
    int out_val;
    __asm__ volatile ("" : "=v"(out_val) : "0"(in_val));
    return out_val;
}

static inline __device__ uint64_t __match_any(int value) {
  value = __optimization_barrier_hack(value);
  bool active = true;
  uint64_t result_mask = 0;

  while (active) {
    // determine what threads have the same value as the currently first active thread
    int first_active_value = __builtin_amdgcn_readfirstlane(value);
    int predicate = (value == first_active_value);
    uint64_t mask = __ballot(predicate);

    // if the current thread has the same value, set its result mask to the current one
    if (predicate) {
      result_mask |= mask;
      active = false;
    }
  }

  return result_mask;
}

Such intrinsic corresponds to "WaveMatch" in HLSL shader model 6.5 (https://microsoft.github.io/DirectX-Specs/d3d/HLSL_ShaderModel6_5.html#wavematch-function) supported by both Nvidia and AMD GPUs (Vega+ as far as I can tell).
The implementation above is what I could reconstruct from looking at some compiled HLSL code using WaveMatch.

I think all the warp intrinsics supported on Nvidia GPUs could be supported as well on AMD GPUs. It is quite surprising to me that they were not all implemented yet.

Best regards,
Epliz