ProjectPhysX/OpenCL-Wrapper

Kernel Dimensions?

PJaramilloV opened this issue · 1 comments

I love what you did here in this Wrapper, it makes OpenCL's syntax so approachable. However i do find myself struggling to understand how to set up the Memory and Kernel like in plain OpenCL with global and local sizes in a {x, x} to interpret 2D data or {y,y,y} for 3D.

I know we can work around this by operating over the 1D index of get_global_(0), I've also noted that there are read and write methods like write_to_device_2d() but can't understand how to use them.

I would appreciate any clarification or help, many thanks again for this wrapper!

Hi @PJaramilloV,

it's best to use a 1D range in OpenCL even for 2D/3D data. Use linear indexing then:

  • 2D
    • kernel global range is Nx*Ny
    • uint n = x+y*Nx;
    • uint x = n%Nx;
      uint y = n/Nx;
  • 3D
    • kernel global range is Nx*Ny*Nz
    • uint n = x+(y+z*Ny)*Nx;
    • uint x = (n%(Nx*Ny))%Nx;
      uint y = (n%(Nx*Ny))/Nx;
      uint z = n/(Nx*Ny);

With this you can go from the linear index n in the OpenCL kernel to x/y/(z) coordinates and be sure to have coalesced memory access. Note: keep in mind the maximum value of uint, 4294967295u; if you have larger global range, use ulong instead for n.


To pass the the 2D/3D sizes Nx/Ny/(Nz) to the OpenCL kernel, you can embed them as macro constants:

const uint Nx=100u, Ny=100u, Nz=100u;
const string defines =
	"#define Nx "+to_string(Nx)+"\n"
	"#define Ny "+to_string(Ny)+"\n"
	"#define Nz "+to_string(Nz)+"\n"
;
Device device(select_device_with_most_flops(), defines+get_opencl_c_code());

and in the OpenCL code you can then use them anywhere, for example:

kernel void add_kernel(global float* A, global float* B, global float* C) { // equivalent to "for(uint n=0u; n<N; n++) {", but executed in parallel
	const uint n = get_global_id(0);
	C[n] = A[n]*(float)Nx+B[n]+(float)Ny;
}

Alternatively, if Nx/Ny/(Nz) are not constants and should change at runtime, pass them as Kernel parameters:

Kernel add_kernel(device, N, "add_kernel", A, B, C, Nx, Ny, Nz);

To modify at runtime:

add_kernel.set_parameters(3u, Nx, Ny, Nz); // kernel parameter positions: A at 0, B at 1, C at 2, Nx at 3 --> starting position to set Nx, Ny, Nz is 3

And add them in OpenCL as additional kernel parameters

kernel void add_kernel(global float* A, global float* B, global float* C, const uint Nx, const uint Ny, const uint Nz) { // equivalent to "for(uint n=0u; n<N; n++) {", but executed in parallel
	const uint n = get_global_id(0);
	C[n] = A[n]*(float)Nx+B[n]+(float)Ny;
}

The write_to_device_2d() etc. functions are there if you have a 1D array formatted in 2D through linear indexing, and you want to copy not the entire thing from CPU to GPU but only a small clipped 2D region. For copying the entire thing, write_to_device() is much faster.

Kind regards,
Moritz