OpenPPL/ppl.nn

cuda convolution kernel input question.

leiwen83 opened this issue · 12 comments

Hi,

I see current implemented cuda conv kernel are either fp16 or int8. And those kernel's data layout is NHWC, as is requred by nvidia's tensor core.
So like ./tools/pplnn.py, where it do the layout transpose? in the cpu side? As from nvprof result, I only see the conv kernel.

If I want to do the transpose at the gpu side, how should I change the command? Or I need to add additional transpose node in the onnx file?

Si-XU commented

PPLNN adds transpose nodes (named Bridge in PPLNN) automatically before and after you execute conv kernel.
The input bridge node will change date type from fp32 to fp16/int8 and change layout from NCHW to NHWC.
The output bridge node will change date type and layout back.

PPLNN adds transpose nodes (named Bridge in PPLNN) automatically before and after you execute conv kernel. The input bridge node will change date type from fp32 to fp16/int8 and change layout from NCHW to NHWC. The output bridge node will change date type and layout back.

Thx your reply. But when I try profiling with nvprof, I fail to see the bridge kernel existed... Am I missing anything? Or could you point me where the transpose code existed?

And I also notice when playing with ./tools/pplnn.py, it would create several cuda kernel naming starts with nvIdxnConv_hmma1688_nhwc*. Does it mean that pplnn.py would try to find all available kernel for specified onnx?

If I want to onnx be feed with fp16 input, how shall I change the command? It seems that current it only support float32 input?

Si-XU commented

Q1: Debug model shows detail of each kernel. You can compile debug model by "./build.sh -DPPLNN_USE_CUDA=ON -DCMAKE_BUILD_TYPE=Debug"
Q2: Yes. During compiling stage, we will choose about 20 kernels for one conv op and the quickest kernel will be used in forward stage.
Q3: If you want to onnx be feed with fp16, you need change your onnx model. If the input type is float16, PPLNN will read data as float16 directly.

Hi,

with "DCMAKE_BUILD_TYPE=Debug" enabled, I could see, it would print some info during execution. I could see it seem executing Bridge_Node_Conv_0_input before the conv, but with nvprof I still cannot find this kernel reside in the gpu. So for "Bridge_Node_Conv_0_input", is it executing in the cpu side?

[DEBUG][2022-06-23 19:42:04.779][opt_graph.cc:206] Create 4 TensorImpl
[INFO][2022-06-23 19:42:04.779][opt_graph.cc:324] added 4 new bridge kernels
[INFO][2022-06-23 19:42:30.049][algo_conv_hmma.cc:134] Compiling Conv_0
[INFO][2022-06-23 19:42:35.356][opt_graph.cc:578] deleted 2 bridge kernels
[INFO][2022-06-23 19:42:35.540][validate_graph.h:143] validating graph ...
[DEBUG][2022-06-23 19:42:35.841][py_tensor.cc:60] data type of input for tensor[input] is [FLOAT32].
[INFO][2022-06-23 19:42:35.905][kernel.cc:131] Before execute kernel[Bridge_Node_Conv_0_input]

While for the fp16 onnx, below is the error message it print. It seem to me, it don't support?

INFO: PPLNN version: [0.8.0], commit: [fb10534b1d20ca88afeb8236834c0b5f9bbadcb8]
[ERROR][2022-06-23 19:47:02.500][utils.cc:204] unsupported onnx data type[FLOAT16] of tensor[weight]
[ERROR][2022-06-23 19:47:02.500][graph_parser.cc:45] ParseTensorProto failed: unsupported
[ERROR][2022-06-23 19:47:02.500][graph_parser.cc:274] ParseGraphInitializer failed.
[ERROR][2022-06-23 19:47:02.500][model_parser.cc:78] parse graph failed: unsupported
[ERROR][2022-06-23 19:47:02.500][runtime_builder_impl.cc:48] parse graph failed: unsupported

with adding some debug log, I locate the Bridge_Node_Conv_0_input as "cuda_kernel_cvtformat_type", which currently is a empty implementation. So it explaiin why I cannot see it in the nvprof, as it never is a real cuda kernel.

And what interest me most is ppl choosing nvIdxnConv_hmma1688_nhwc_b64x32_w16x32_k32_s32_nosmem for the onnx, but I check the dims before it execution, and find it keep the same dimension as the input. My input is as [1,32,832,1920], and find CudaDataConverter::Convert the dst_desc is still [1,32,832,1920].

So am I missing anything? Does it mean that nvIdxnConv_hmma1688_nhwc_b64x32_w16x32_k32_s32_nosmem take the nchw as its input?

While for the fp16 onnx, below is the error message it print. It seem to me, it don't support?

you can pull the latest master and try again.

Si-XU commented

with adding some debug log, I locate the Bridge_Node_Conv_0_input as "cuda_kernel_cvtformat_type", which currently is a empty implementation. So it explaiin why I cannot see it in the nvprof, as it never is a real cuda kernel.

And what interest me most is ppl choosing nvIdxnConv_hmma1688_nhwc_b64x32_w16x32_k32_s32_nosmem for the onnx, but I check the dims before it execution, and find it keep the same dimension as the input. My input is as [1,32,832,1920], and find CudaDataConverter::Convert the dst_desc is still [1,32,832,1920].

So am I missing anything? Does it mean that nvIdxnConv_hmma1688_nhwc_b64x32_w16x32_k32_s32_nosmem take the nchw as its input?

The shape dims are always displayed in nchw, but the date format is NHWC. You can check the input data format by printing
input->GetShape()->GetDataFormat();

While for the fp16 onnx, below is the error message it print. It seem to me, it don't support?

you can pull the latest master and try again.

@ouonline with latest commit, it still has error:

INFO: PPLNN version: [0.8.0], commit: [32657fce0fae99bae9fff9468080d92809e79bfe]
[INFO][2022-06-28 15:49:00.920][utils.cc:456] total partition(s) of graph[torch-jit-export]: 1.
[DEBUG][2022-06-28 15:49:00.921][opt_graph.cc:206] Create 4 TensorImpl
[INFO][2022-06-28 15:49:00.921][opt_graph.cc:324] added 4 new bridge kernels
[INFO][2022-06-28 15:49:26.376][algo_conv_hmma.cc:134] Compiling Conv_0
[INFO][2022-06-28 15:49:31.730][opt_graph.cc:581] deleted 2 bridge kernels
[DEBUG][2022-06-28 15:49:31.916][buffered_cuda_device.cc:68] buffer manager[StackBufferManager] allocates [2044760064] bytes.
[INFO][2022-06-28 15:49:31.922][validate_graph.h:143] validating graph ...
Traceback (most recent call last):
  File "./tools/pplnn.py", line 619, in <module>
    SetRandomInputs(in_shapes, runtime)
  File "./tools/pplnn.py", line 400, in SetRandomInputs
    in_data = (upper_bound - lower_bound) * rng.random(dims, dtype = np_data_type) * lower_bound
  File "_generator.pyx", line 292, in numpy.random._generator.Generator.random
TypeError: Unsupported dtype dtype('float16') for random

with adding some debug log, I locate the Bridge_Node_Conv_0_input as "cuda_kernel_cvtformat_type", which currently is a empty implementation. So it explaiin why I cannot see it in the nvprof, as it never is a real cuda kernel.
And what interest me most is ppl choosing nvIdxnConv_hmma1688_nhwc_b64x32_w16x32_k32_s32_nosmem for the onnx, but I check the dims before it execution, and find it keep the same dimension as the input. My input is as [1,32,832,1920], and find CudaDataConverter::Convert the dst_desc is still [1,32,832,1920].
So am I missing anything? Does it mean that nvIdxnConv_hmma1688_nhwc_b64x32_w16x32_k32_s32_nosmem take the nchw as its input?

The shape dims are always displayed in nchw, but the date format is NHWC. You can check the input data format by printing input->GetShape()->GetDataFormat();

@Si-XU I checked the data format, DATAFORMAT_NHWC8/DATATYPE_FLOAT16. So it is NHWC.
But why there is no real conversion?...

Another question is that how to perform a real inference with onnx and input data? I see there is --input and --in-shapes in the pplnn.py, are both needed?

I further dump the ctx->GetInput(0)->GetBufferPtr() in ConvHmmaKernel::DoExecute, and find its content is the same with --input image has. So it confirm that there is no conversion at all. But I find there is one cuda error of 209 at the beginning of ConvHmmaKernel::DoExecute using cudaGetLastError() API call, does it mean that there is some kernel is not availble for my platform, so that conversion is not performed? I am using nv3090 to do those tests.

Si-XU commented

"I am using nv3090 to do those tests."
We are not support GeForce RTX 3090 Ti right now.

If you just want to test correctness, you can change 75 to 86 in this line
https://github.com/openppl-public/ppl.nn/blob/master/src/ppl/nn/engines/cuda/impls/CMakeLists.txt#L6

We will release Ampere kernel soon. If you want to test performace, you can wait until tahat time.

@leiwen83 SetRandomInputs in pplnn.py doesn't support float16 currently. Generate float16 by yourself and pass it to pplnn.py by specifying --inputs.