kenba/opencl3

Include example showing power of set_local_work_size/set_local_work_sizes/set_global_work_size/set_global_work_sizes

Closed this issue · 1 comments

I'm trying to better understand how to "benchmark" a GPU

For example, from a modified clinfo.rs example:

            println!("\tCL_DEVICE_MAX_WORK_ITEM_SIZES = {:?}", device.max_work_item_sizes());
            println!("\tCL_DEVICE_MAX_WORK_GROUP_SIZE = {:?}", device.max_work_group_size());
            println!("\tCL_DEVICE_MAX_COMPUTE_UNITS = {:?}", device.max_compute_units());
            println!("\tCL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = {:?}", device.max_work_item_dimensions());
	CL_DEVICE_MAX_WORK_ITEM_SIZES = Ok([256, 256, 256])
	CL_DEVICE_MAX_WORK_GROUP_SIZE = Ok(256)
	CL_DEVICE_MAX_COMPUTE_UNITS = Ok(10)
	CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = Ok(3)

Ok, how do I now turn around and spawn "as many kernel events as possible for maximum parallelization + performance"?

// execute
    let start = Instant::now();
    let kernel_event = unsafe {
        ExecuteKernel::new(&kernel)
            .set_arg(&device_input_buffer)
            .set_arg(&device_input_size_buffer)
            .set_arg(&device_output_buffer)
            .set_global_work_size(2560) // TODO set_global_work_size or set_global_work_sizes
            .set_local_work_size(256) // TODO: set_local_work_size or set_local_work_sizes
            .enqueue_nd_range(&command_queue)?
    };
    kernel_event.wait()?;
    command_queue.finish()?;

I get the concept of [x, y, z], but I do not get how to "calculate" what global work size should be in relation to local work sizes

Brandon, this is not the correct forum to ask OpenCL questions.
Post a question on stack-overflow or read a book on OpenCL