How could I use cuda block shared memory in the self-defined operator?

Question

How could I use cuda block shared memory in the self-defined operator?

CoinCheung opened this issue 5 years ago · 6 comments

I am working on ubuntu16.04 with pytorch1.3 installed from conda.
My cuda version is 10.1.243 and cudnn version is 7.
I have 8 t4 gpus on my server and the gcc version is the default 5.4.
When submitting a bug report, please include the following information (where relevant):

The simplified version of my code main.cu is like this:

#include <torch/extension.h>
#include <ATen/ATen.h>
#include <ATen/cuda/CUDAContext.h>

#include <THC/THC.h>
#include <THC/THCAtomics.cuh>
#include <THC/THCDeviceUtils.cuh>

#include <cuda.h>
#include <cuda_runtime.h>
#include <cfloat>



// kernel function for forward and backward
template<typename scalar_t>
__global__ void TestForward() {
    int tid = threadIdx.x + (blockIdx.x * gridDim.y + blockIdx.y) * blockDim.x;
    int stride = blockDim.x * blockDim.y * gridDim.x;
    extern __shared__ scalar_t sdata[]; 
    sdata[threadIdx.x] = 0; // numer
    sdata[threadIdx.x + blockDim.x] = 0; // denor
    __syncthreads();
}


// cuda forward and backward
at::Tensor test_forward_cuda() {
    dim3 grid(4096);
    dim3 block(512);

    auto ret = at::empty(1024);
    AT_DISPATCH_FLOATING_TYPES_AND_HALF(ret.scalar_type(), "forward", [&] {
        TestForward<scalar_t><<<grid, block, 4096, at::cuda::getCurrentCUDAStream()>>>();
    });
    return ret;
}


// python inferface
at::Tensor test_forward() {
    return test_forward_cuda();
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("test_forward", &test_forward, "forward");
}

and the setup.py is like this:

from setuptools import setup, Extension
from torch.utils import cpp_extension
setup(
    name='pytorch',
    ext_modules=[
        cpp_extension.CUDAExtension(
            'main_cpp',
            ['main.cu']),
    ],
    cmdclass={'build_ext': cpp_extension.BuildExtension}
)

I compiled it with command python setup.py install. And the error message is like this:

main.cu(21): error: declaration is incompatible with previous "sdata"
(21): here
          detected during instantiation of "void TestForward<scalar_t>() [with scalar_t=float]" 
(34): here

main.cu(21): warning: a host variable("sdata") redeclared with __shared__
          detected during instantiation of "void TestForward<scalar_t>() [with scalar_t=float]" 
(34): here

main.cu(21): warning: a host variable("sdata") redeclared with __shared__
          detected during instantiation of "void TestForward<scalar_t>() [with scalar_t=c10::Half]" 
(34): here

1 error detected in the compilation of "/tmp/tmpxft_00022124_00000000-6_main.cpp1.ii".

What is the cause of this and how could I cope with this problem please ?

Answer 1 · 2020-05-07T09:03:01.000Z

I'm pretty sure you need to specify a size for the shared array sdata
Here if I understood correctly it's 2* 512 ?

You should probably define these numbers at the beginning of your file so that it's et at compile time, but you still have some flexibility. See an example here : https://github.com/ClementPinard/Pytorch-Correlation-extension/blob/master/Correlation_Module/correlation_cuda_kernel.cu#L47

Answer 2 · 2020-05-07T10:09:31.000Z

@ClementPinard Thanks for replying !! As far as I know, cuda seems support dynamic allocated shared memory within a block, which is defined like extern __shared__ scalar_t sdata[]. Doesn't pytorch support that ? What if the size of shared memory is not known in the compiling time, do we have an option to assign the size dynamically?

Answer 3 · 2020-05-07T11:48:09.000Z

Ah actually you are right, you can use dynamic shared arrays.
Apparently, you need to specify the shared object size in another option in the kernel call.
See here : https://devblogs.nvidia.com/using-shared-memory-cuda-cc/

Answer 4 · 2020-05-07T23:39:04.000Z

But I did have called the kernel function with shared memory sized assgined, I called it like this:

TestForward<scalar_t><<<grid, block, 4096, at::cuda::getCurrentCUDAStream()>>>();

I assigned 4k shared memory for each block in this way. Would you please tell me why does this not work?

Answer 5 · 2020-05-09T15:08:18.000Z

Ok sorry about misleading you, your code is mostly fine. I tried your code, and the problem seems to come from the template and the fact you use three different specializations of the template (float double and half), because doesn't allow two differently typed dynamic shared array with the same name ¯\_(ツ)_/¯

See here for more info : https://stackoverflow.com/questions/27570552/templated-cuda-kernel-with-dynamic-shared-memory

in the end you need to change the line

extern __shared__ scalar_t sdata[];

with the two lines

extern __shared__ __align__(sizeof(scalar_t)) unsigned char sdata_uchar[];
    scalar_t *sdata = reinterpret_cast<scalar_t *>(sdata_uchar);

Answer 6 · 2020-05-12T02:28:50.000Z

Thanks !!! It works now, but I have two more warnings:

/miniconda/envs/py36/lib/python3.6/site-packages/torch/include/torch/csrc/autograd/profiler.h(97): warning: attribute "__visibility__" does not apply here
warning: specified alignment (4) is different from alignment (8) specified on a previous declaration
          detected during instantiation of "void compute_numer_denor(int, const scalar_t *, const int64_t *, scalar_t *, scalar_t *, float, float) [with scalar_t=float]"

Will this be fine if my code go with these two warnings?