dmlc/mshadow

Potential cuda kernel lauching problem of `reduce_except_dim<0,..>` operator

sxjscience opened this issue · 11 comments

I find that if we use reduce_except_dim<0,..>, we will ultimately call MapReduceKeepDim1(https://github.com/dmlc/mshadow/blob/master/mshadow/tensor_gpu-inl.h#L153-L155), which may have problems for large matrix. In fact, in the implementation of MapReduceKeepDim1, the dimGrid is set directly as p[1] (https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L183-L184), which may exceed the boundary of 65536(https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L45).

This problem does not exist for MapReduceKeepLowest, which has used MemUnits to set kernel launching parameters https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L142-L143. Should we change the implementation of MapReduceKeepDim1 to be similar to MapReduceKeepLowest in the future?

yes, at least we should first ensure the kernel fails with an error message when the bound get hit

@tqchen For example, the following code will have problem since 100*100*100 > 65536
Output as follows

[14:27:59] d:\hkust\mshadow\mshadow\./cuda/tensor_gpu-inl.cuh:46: too large launch parameter: MapReduceKeepDim1[256,1,1]
#include<iostream>
using namespace mshadow;
using namespace mshadow::expr;
int main() {
  Stream<gpu> *s1 = NewStream<gpu>();
  InitTensorEngine<gpu>();
  Tensor<gpu, 4, float> in = NewTensor<gpu, float>(Shape4(100, 100, 100, 2), 1.0f, 0, s1);
  Tensor<gpu, 1, float> out = NewTensor<gpu, float>(Shape1(100 * 100 * 100), 0.0f, 0, s1);
  out = reduce_except_dim<0, red::sum>(reshape(in, Shape2(100 * 100 * 100, 2)));
  Tensor<cpu, 1, float> out_cpu = NewTensor<cpu, float>(Shape1(100 * 100 * 100), 0.0f);
  Copy(out_cpu, out, s1);
  std::cout << out_cpu[0] << std::endl;
  return 0;
}

However, the following code that uses reduce_except_dim<1,...> has no problem

#include<iostream>
using namespace mshadow;
using namespace mshadow::expr;
int main() {
  Stream<gpu> *s1 = NewStream<gpu>();
  InitTensorEngine<gpu>();
  Tensor<gpu, 4, float> in = NewTensor<gpu, float>(Shape4(100, 100, 100, 2), 1.0f, 0, s1);
  Tensor<gpu, 1, float> out = NewTensor<gpu, float>(Shape1(100 * 100 * 100), 0.0f, 0, s1);
  out = reduce_except_dim<1, red::sum>(reshape(in, Shape2(2, 100 * 100 * 100)));
  Tensor<cpu, 1, float> out_cpu = NewTensor<cpu, float>(Shape1(100 * 100 * 100), 0.0f);
  Copy(out_cpu, out, s1);
  std::cout << out_cpu[0] << std::endl;
  return 0;
}

Has this been fixed?

Not yet.

We recently encountered this ourselves. Is there any way to work around this at user level?

Any update?

I'm outside and will look into it when I come back to the university. Basically, we could set the grid to be larger by setting the ydim. For example, if we find p[1] is larger than 65536, we can set grid to be grid(ceil(p[1] / max), max).

@tornadomeet @taliesinb Should be solved by #285

@sxjscience thx, i'll test it today.

@sxjscience i just test, #285 does not fix issue apache/mxnet#7523

if training on signle gpu, then the bug has been fixed, but when training multi-gpus, it will output error:
image

i think we can close this issue, because the multi-gpu bugs is not related with this fix i think. @sxjscience i'll open a new issue in MXNet.