ddemidov/vexcl

AMD SI Cards weird results

skn123 opened this issue · 20 comments

A bug with AMD SI cards was reported in this thread
https://community.amd.com/message/2869393#comment-2869393

A solution was also proposed on that thread. If I were to implement that solution within vexcl how do I go about doing that? Essentially, with every kernel, a null kernel would need to passed. It can safely be ignored with an #ifdef declaration for other cards. I can show the results of an example that illustrate this case:

naths@naths-HP-Laptop-15-bs1xx:~/build/vexcl/examples$ ./complex_simple

  1. Hainan (AMD Accelerated Parallel Processing)

X * Y = (0,16) * (16,0) = (-5.62355e+303,7.1998e-304)i
X * Y = (1,15) * (15,1) = (1.76736e+186,-2.529e-186)i
X * Y = (2,14) * (14,2) = (-5.43986e-256,-7.4239e-199)i
X * Y = (3,13) * (13,3) = (-1.15801e-125,-1.42689e-184)i
X * Y = (4,12) * (12,4) = (-1.5799e-103,-2.30511e+235)i
X * Y = (5,11) * (11,5) = (2.83923e+103,-2.783e+307)i
X * Y = (6,10) * (10,6) = (-4.95723e+305,1.35145e+188)i
X * Y = (7,9) * (9,7) = (-2.73677e-48,-0.00275755)i
X * Y = (8,8) * (8,8) = (-2.26843e-106,-1.45955e-201)i
X * Y = (9,7) * (7,9) = (1.79762e+106,2.97045e+201)i
X * Y = (10,6) * (6,10) = (-nan,2.14326e-308)i
X * Y = (11,5) * (5,11) = (-8.49166e-200,5.26253e+199)i
X * Y = (12,4) * (4,12) = (-3.88897e+306,2.78449e+188)i
X * Y = (13,3) * (3,13) = (1.69952e+184,1.77589e-234)i
X * Y = (14,2) * (2,14) = (-1.94762e-104,-4.35661e+232)i
X * Y = (15,1) * (1,15) = (-1.82884e-128,1.00333e-232)i
X / Y = (0,16) / (16,0) = (0,256)
X / Y = (1,15) / (15,1) = (0,226)
X / Y = (2,14) / (14,2) = (0,200)
X / Y = (3,13) / (13,3) = (0,178)
X / Y = (4,12) / (12,4) = (0,160)
X / Y = (5,11) / (11,5) = (0,146)
X / Y = (6,10) / (10,6) = (0,136)
X / Y = (7,9) / (9,7) = (0,130)
X / Y = (8,8) / (8,8) = (0,128)
X / Y = (9,7) / (7,9) = (0,130)
X / Y = (10,6) / (6,10) = (0,136)
X / Y = (11,5) / (5,11) = (0,146)
X / Y = (12,4) / (4,12) = (0,160)
X / Y = (13,3) / (3,13) = (0,178)
X / Y = (14,2) / (2,14) = (0,200)
X / Y = (15,1) / (1,15) = (0,226)

If I run it the second time
naths@naths-HP-Laptop-15-bs1xx:~/build/vexcl/examples$ ./complex_simple

  1. Hainan (AMD Accelerated Parallel Processing)

X * Y = (0,16) * (16,0) = (0,1)i
X * Y = (1,15) * (15,1) = (0.132743,0.99115)i
X * Y = (2,14) * (14,2) = (0.28,0.96)i
X * Y = (3,13) * (13,3) = (0.438202,0.898876)i
X * Y = (4,12) * (12,4) = (0.6,0.8)i
X * Y = (5,11) * (11,5) = (0.753425,0.657534)i
X * Y = (6,10) * (10,6) = (0.882353,0.470588)i
X * Y = (7,9) * (9,7) = (0.969231,0.246154)i
X * Y = (8,8) * (8,8) = (1,0)i
X * Y = (9,7) * (7,9) = (0.969231,-0.246154)i
X * Y = (10,6) * (6,10) = (0.882353,-0.470588)i
X * Y = (11,5) * (5,11) = (0.753425,-0.657534)i
X * Y = (12,4) * (4,12) = (0.6,-0.8)i
X * Y = (13,3) * (3,13) = (0.438202,-0.898876)i
X * Y = (14,2) * (2,14) = (0.28,-0.96)i
X * Y = (15,1) * (1,15) = (0.132743,-0.99115)i
X / Y = (0,16) / (16,0) = (0,256)
X / Y = (1,15) / (15,1) = (0,226)
X / Y = (2,14) / (14,2) = (0,200)
X / Y = (3,13) / (13,3) = (0,178)
X / Y = (4,12) / (12,4) = (0,160)
X / Y = (5,11) / (11,5) = (0,146)
X / Y = (6,10) / (10,6) = (0,136)
X / Y = (7,9) / (9,7) = (0,130)
X / Y = (8,8) / (8,8) = (0,128)
X / Y = (9,7) / (7,9) = (0,130)
X / Y = (10,6) / (6,10) = (0,136)
X / Y = (11,5) / (5,11) = (0,146)
X / Y = (12,4) / (4,12) = (0,160)
X / Y = (13,3) / (3,13) = (0,178)
X / Y = (14,2) / (2,14) = (0,200)
X / Y = (15,1) / (1,15) = (0,226)

The results look fine now.
The result of clinfo is
naths@naths-HP-Laptop-15-bs1xx:~/build/vexcl/examples$ clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (2639.3)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon (TM) R5 M330
Device Topology: PCI[ B#1, D#0, F#0 ]
Max compute units: 5
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 750Mhz
Address bits: 64
Max memory allocation: 1596905472
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 2146349056
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 0x7f5a050f49f0
Name: Hainan
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 2639.3
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2639.3)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

Can you please clarify what exactly is the workaround here? Is it enough to just declare a dummy kernel before or after the source of the real one? Or should the dummy kernel be called after the real one?

The workaround is given here
https://github.com/rdemaria/sixtracklib_gsoc18/blob/master/studies/study0/hello_workaround.cpp
Lines 96-98
queue.enqueueNDRangeKernel( null, cl::NullRange, 1, cl::NullRange); // null kernel seems to solve the issue
I think this is done after the main ones.

Ok, you can try to add the workaround after line 130 here:

/// Enqueue the kernel to the specified command queue.
void operator()(const cl::CommandQueue &q) {
q.enqueueNDRangeKernel(K, cl::NullRange, g_size, w_size);
argpos = 0;
}

If this works for you, you'll probably need to guard it with a runtime condition (because you can have several cards in a vexcl context, and only some of those might need the workaround). You can use the command queue object to check if the workaround is necessary, something along these lines (you can use get_device() function to get the compute device associated with the queue):

if (vex::Filter::Platform("Portable Computing Language")(ctx.device(0)))

I tried this

/// Enqueue the kernel to the specified command queue.
        void operator()(const cl::CommandQueue &q) {
            q.enqueueNDRangeKernel(K, cl::NullRange, g_size, w_size);
            std::vector<cl::Platform> platformList;

		        // Pick platform
		        cl::Platform::get(&platformList);
            // NULL
            static const char sourceNull[] =
                "#if defined(cl_khr_fp64)\n"
                "#  pragma OPENCL EXTENSION cl_khr_fp64: enable\n"
                "#elif defined(cl_amd_fp64)\n"
                "#  pragma OPENCL EXTENSION cl_amd_fp64: enable\n"
                "#else\n"
                "#  error double precision is not supported\n"
                "#endif\n"
                "kernel void null(\n"
                "       )\n"
                "{\n"
                "}\n"
                ;

		        // Pick first platform
		        cl_context_properties cprops[] = {
		            CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0};
		        cl::Context context(CL_DEVICE_TYPE_GPU, cprops);

		        // Query the set of devices attched to the context
		        std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();

		        // Create and program from source
		        cl::Program::Sources sources(1, std::make_pair(sourceNull, 0));
		        cl::Program program(context, sources);

		        // Build program
		        program.build(devices);

		        // Create kernel object
		        cl::Kernel kNull(program, "null");
            q.enqueueNDRangeKernel(kNull, cl::NullRange, g_size, w_size);
            argpos = 0;
        }

This is the only way I can insert that null kernel here given the limitations of the current API.
If I do this, then during running the program, I am getting this:

naths@naths-HP-Laptop-15-bs1xx:~/build/vexcl/examples$ ./complex_simple 
1. Hainan (AMD Accelerated Parallel Processing)

terminate called after throwing an instance of 'cl::Error'
  what():  clEnqueueNDRangeKernel
Aborted (core dumped)

Am I doing something wrong here?

Something like 01550e2 should be enough to see if the workaround actually helps. Can you check that? Then we can think about limiting it to AMD SI GPUs.

and bingo !

  1. Hainan (AMD Accelerated Parallel Processing)

X * Y = (0,16) * (16,0) = (0,256)i
X * Y = (1,15) * (15,1) = (0,226)i
X * Y = (2,14) * (14,2) = (0,200)i
X * Y = (3,13) * (13,3) = (0,178)i
X * Y = (4,12) * (12,4) = (0,160)i
X * Y = (5,11) * (11,5) = (0,146)i
X * Y = (6,10) * (10,6) = (0,136)i
X * Y = (7,9) * (9,7) = (0,130)i
X * Y = (8,8) * (8,8) = (0,128)i
X * Y = (9,7) * (7,9) = (0,130)i
X * Y = (10,6) * (6,10) = (0,136)i
X * Y = (11,5) * (5,11) = (0,146)i
X * Y = (12,4) * (4,12) = (0,160)i
X * Y = (13,3) * (3,13) = (0,178)i
X * Y = (14,2) * (2,14) = (0,200)i
X * Y = (15,1) * (1,15) = (0,226)i
X / Y = (0,16) / (16,0) = (0,1)
X / Y = (1,15) / (15,1) = (0.132743,0.99115)
X / Y = (2,14) / (14,2) = (0.28,0.96)
X / Y = (3,13) / (13,3) = (0.438202,0.898876)
X / Y = (4,12) / (12,4) = (0.6,0.8)
X / Y = (5,11) / (11,5) = (0.753425,0.657534)
X / Y = (6,10) / (10,6) = (0.882353,0.470588)
X / Y = (7,9) / (9,7) = (0.969231,0.246154)
X / Y = (8,8) / (8,8) = (1,0)
X / Y = (9,7) / (7,9) = (0.969231,-0.246154)
X / Y = (10,6) / (6,10) = (0.882353,-0.470588)
X / Y = (11,5) / (5,11) = (0.753425,-0.657534)
X / Y = (12,4) / (4,12) = (0.6,-0.8)
X / Y = (13,3) / (3,13) = (0.438202,-0.898876)
X / Y = (14,2) / (2,14) = (0.28,-0.96)
X / Y = (15,1) / (1,15) = (0.132743,-0.99115)

So, how do you check if the workaround is required? Is 'Hainan' for a GPU name enough?

actually yes. I would suggest it being within an ifdef statement. By default it is set to false in cmake and only if the user wants it gets enabled. Very localized. Hainan is one such name. The actual name as given by ubuntu is
01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Sun XT [Radeon HD 8670A/8670M/8690M / R5 M330] (rev 83)

Would you be able to implement it?

I can implement the CMake part and can send you a PR. Another topic. Can such a workaround be implemented in ViennaCl also; given that we have a lot of examples in vexcl also?

Unfortunately, the WIP does not show up in the forked code. I can only see it when I pull it from Master and do a reset. I think the changes are very minimal
#IFDEF AMD_SI_WORKAROUND
....
#ENDIF

and in cmake a cache variable that prompts the user to check it if his card is a SI GPU (and a reference to this issue) would suffice,

Please see if #255 is working for you. Re ViennaCL, you would have to open a separate issue there. I am not sure what is the best way to implement similar workaround in ViennaCL.

I have posted a comment there. If you must go with naming then there are other SI card names also
https://wiki.gentoo.org/wiki/AMDGPU
that should be included. Else you can drop it and direct the user to that web page to check if his/her card is a Southern Island card. Otherwise the workaround is fine.

There is another comment that I have. Suppose if one wants to enqueue multiple OpenCL kernels, then the original code is the correct way to do it. The workaround will fail there as we would be creating a Null kernel after every code. I think a new function would need to be defined (and the API for the current code to be changed) to call the Null kernel only when all other kernels have been passed to the device. Of course, the #VEXCL_AMD_SI_WORKAROUND will be present everywhere.

I don't see how the workaround can alter outcome of any computation. It just puts an empty kernel after each normal one into the command queue. Also, you can just disable the workaround and submit an empty kernel yourself at any time you like.

Precisely. We would need something in the API to enable/disable placing the empty kernel. Currently it is enabled by default.

you either

  1. enable the workaround, which means the empty kernel is submitted after each operation, or
  2. you disable the workaround, create an empty kernel yourself (using vexcl custom kernel API), and submit it whenever you fill you need to.

I am talking about 2. Currently, the only way I can enable the workaround is via an #ifdef. A cleaner way to implement this would be to add an auxiliary function to add the empty kernel (again protected by the #ifdef). Then whatever you say holds true and it does not affect anyone who do not need the workaround.

Currently adding an empty kernel is as simple as

vex::backend::kernel dummy(ctx.queue(0), "kernel void dummy() {}", "dummy");

and then you can submit it with

dummy(ctx.queue(0));

I would say it's easy enough to not invent a special API for this.

Great... that should help :) I think this can be closed then