intel/clDNN

Why does clDNN conv2d barely use any GPU shared memory (__local)

Laurawly opened this issue · 3 comments

Hi clDNN team! I recently look into your convolution code and find that expect in winograd algorithm, the conv2d primitive doesn't use any __local cache which should be the fastest gpu cache. I ran on Intel Gen9 GPU and the convolution is still pretty fast. I'm still studying the story behind the performance boost, and it'll be great if you could give any insights,

Hi Larurwaly,
we did a lot of experiments and the fastest implementation that we were able to provide was the one without __local memory.
regards
Tomek

Hi Laurawly,

Many of our convolution kernels use subgroups shuffle and broadcast functions to share data instead of shared local memory. On our device, this lets us share data through the register file, which is even faster than going through shared local memory.

Here's an article that describes how to use subgroup functions to accelerate SGEMM using a similar technique:

https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics

@bashbaug Thanks a lot for your reply! The article gives me a lot of help in understanding intel_sub_groups. I wonder whether the extension package optimizes the opencl shared memory usage patterns or it directly optimizes the bash code under opencl.