Kernel Dimensions?
PJaramilloV opened this issue · 1 comments
I love what you did here in this Wrapper, it makes OpenCL's syntax so approachable. However i do find myself struggling to understand how to set up the Memory and Kernel like in plain OpenCL with global and local sizes in a {x, x} to interpret 2D data or {y,y,y} for 3D.
I know we can work around this by operating over the 1D index of get_global_(0), I've also noted that there are read and write methods like write_to_device_2d() but can't understand how to use them.
I would appreciate any clarification or help, many thanks again for this wrapper!
Hi @PJaramilloV,
it's best to use a 1D range in OpenCL even for 2D/3D data. Use linear indexing then:
- 2D
- kernel global range is
Nx*Ny
-
uint n = x+y*Nx;
-
uint x = n%Nx; uint y = n/Nx;
- kernel global range is
- 3D
- kernel global range is
Nx*Ny*Nz
-
uint n = x+(y+z*Ny)*Nx;
-
uint x = (n%(Nx*Ny))%Nx; uint y = (n%(Nx*Ny))/Nx; uint z = n/(Nx*Ny);
- kernel global range is
With this you can go from the linear index n
in the OpenCL kernel to x
/y
/(z
) coordinates and be sure to have coalesced memory access. Note: keep in mind the maximum value of uint
, 4294967295u
; if you have larger global range, use ulong
instead for n
.
To pass the the 2D/3D sizes Nx
/Ny
/(Nz
) to the OpenCL kernel, you can embed them as macro constants:
const uint Nx=100u, Ny=100u, Nz=100u;
const string defines =
"#define Nx "+to_string(Nx)+"\n"
"#define Ny "+to_string(Ny)+"\n"
"#define Nz "+to_string(Nz)+"\n"
;
Device device(select_device_with_most_flops(), defines+get_opencl_c_code());
and in the OpenCL code you can then use them anywhere, for example:
kernel void add_kernel(global float* A, global float* B, global float* C) { // equivalent to "for(uint n=0u; n<N; n++) {", but executed in parallel
const uint n = get_global_id(0);
C[n] = A[n]*(float)Nx+B[n]+(float)Ny;
}
Alternatively, if Nx
/Ny
/(Nz
) are not constants and should change at runtime, pass them as Kernel parameters:
Kernel add_kernel(device, N, "add_kernel", A, B, C, Nx, Ny, Nz);
To modify at runtime:
add_kernel.set_parameters(3u, Nx, Ny, Nz); // kernel parameter positions: A at 0, B at 1, C at 2, Nx at 3 --> starting position to set Nx, Ny, Nz is 3
And add them in OpenCL as additional kernel parameters
kernel void add_kernel(global float* A, global float* B, global float* C, const uint Nx, const uint Ny, const uint Nz) { // equivalent to "for(uint n=0u; n<N; n++) {", but executed in parallel
const uint n = get_global_id(0);
C[n] = A[n]*(float)Nx+B[n]+(float)Ny;
}
The write_to_device_2d()
etc. functions are there if you have a 1D array formatted in 2D through linear indexing, and you want to copy not the entire thing from CPU to GPU but only a small clipped 2D region. For copying the entire thing, write_to_device()
is much faster.
Kind regards,
Moritz