Potential cuda kernel lauching problem of `reduce_except_dim<0,..>` operator

Question

Potential cuda kernel lauching problem of `reduce_except_dim<0,..>` operator

sxjscience opened this issue 8 years ago · 11 comments

I find that if we use reduce_except_dim<0,..>, we will ultimately call MapReduceKeepDim1(https://github.com/dmlc/mshadow/blob/master/mshadow/tensor_gpu-inl.h#L153-L155), which may have problems for large matrix. In fact, in the implementation of MapReduceKeepDim1, the dimGrid is set directly as p[1] (https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L183-L184), which may exceed the boundary of 65536(https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L45).

This problem does not exist for MapReduceKeepLowest, which has used MemUnits to set kernel launching parameters https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L142-L143. Should we change the implementation of MapReduceKeepDim1 to be similar to MapReduceKeepLowest in the future?

sxjscience commented 8 years ago

Not yet.

Answer 1 · 2016-06-15T20:20:44.000Z

yes, at least we should first ensure the kernel fails with an error message when the bound get hit

Answer 2 · 2016-06-16T06:30:58.000Z

@tqchen For example, the following code will have problem since 100*100*100 > 65536
Output as follows

[14:27:59] d:\hkust\mshadow\mshadow\./cuda/tensor_gpu-inl.cuh:46: too large launch parameter: MapReduceKeepDim1[256,1,1]

#include<iostream>
using namespace mshadow;
using namespace mshadow::expr;
int main() {
  Stream<gpu> *s1 = NewStream<gpu>();
  InitTensorEngine<gpu>();
  Tensor<gpu, 4, float> in = NewTensor<gpu, float>(Shape4(100, 100, 100, 2), 1.0f, 0, s1);
  Tensor<gpu, 1, float> out = NewTensor<gpu, float>(Shape1(100 * 100 * 100), 0.0f, 0, s1);
  out = reduce_except_dim<0, red::sum>(reshape(in, Shape2(100 * 100 * 100, 2)));
  Tensor<cpu, 1, float> out_cpu = NewTensor<cpu, float>(Shape1(100 * 100 * 100), 0.0f);
  Copy(out_cpu, out, s1);
  std::cout << out_cpu[0] << std::endl;
  return 0;
}

However, the following code that uses reduce_except_dim<1,...> has no problem

#include<iostream>
using namespace mshadow;
using namespace mshadow::expr;
int main() {
  Stream<gpu> *s1 = NewStream<gpu>();
  InitTensorEngine<gpu>();
  Tensor<gpu, 4, float> in = NewTensor<gpu, float>(Shape4(100, 100, 100, 2), 1.0f, 0, s1);
  Tensor<gpu, 1, float> out = NewTensor<gpu, float>(Shape1(100 * 100 * 100), 0.0f, 0, s1);
  out = reduce_except_dim<1, red::sum>(reshape(in, Shape2(2, 100 * 100 * 100)));
  Tensor<cpu, 1, float> out_cpu = NewTensor<cpu, float>(Shape1(100 * 100 * 100), 0.0f);
  Copy(out_cpu, out, s1);
  std::cout << out_cpu[0] << std::endl;
  return 0;
}

Answer 3 · 2016-09-12T21:59:55.000Z

Has this been fixed?

Answer 4 · 2017-08-18T09:26:08.000Z

We recently encountered this ourselves. Is there any way to work around this at user level?

Answer 5 · 2017-08-19T00:47:09.000Z

Any update?

Answer 6 · 2017-08-19T01:01:34.000Z

I'm outside and will look into it when I come back to the university. Basically, we could set the grid to be larger by setting the ydim. For example, if we find p[1] is larger than 65536, we can set grid to be grid(ceil(p[1] / max), max).

Answer 7 · 2017-08-20T13:28:40.000Z

@tornadomeet @taliesinb Should be solved by #285

Answer 8 · 2017-08-21T00:38:38.000Z

@sxjscience thx, i'll test it today.

Answer 9 · 2017-08-21T03:10:38.000Z

@sxjscience i just test, #285 does not fix issue apache/mxnet#7523

if training on signle gpu, then the bug has been fixed, but when training multi-gpus, it will output error:

Answer 10 · 2017-08-21T08:53:34.000Z

i think we can close this issue, because the multi-gpu bugs is not related with this fix i think. @sxjscience i'll open a new issue in MXNet.