Benchmark timing does not change when kernel is changed
Closed this issue · 6 comments
System information
- OS Platform and Distribution: Linux Ubuntu 16.04.7 LTS
- NDK version: 19c
- MACE version: v0.13.0 (9a06864)
- Python version: 3.6.3
- CMake version: 3.19.2
Describe the problem
Benchmark timing does not change when kernel is changed; which also does not match the timing when an excerpt of the kernel is run in another OpenCL test-bench.
To Reproduce
The following experiments are done on a Pixel 3 XL phone, which integrated a Snapdragon 845 SoC with an Adreno 630 GPU, with Android 9 installed (build number is PD1A.180720.030
). The reported OpenCL driver version is OpenCL 2.0 QUALCOMM build: Perforce changelist #3929146 Compiler E031.36.01.00 (Android)
. And I'm using the official docker environment for build. The model used for testing is converted from ResNet-50 v2 from MACE model zoo.
When I changed this line
mace/mace/ops/opencl/cl/conv_2d_1x1.cl
Line 66 in 9a06864
to
for (short in_ch_blk = 0; in_ch_blk < in_ch_blks * 100; ++in_ch_blk) {
to loop for 100-times the iterations. This should effectively increase kernel time, but the benchmarked result does not change. E.g., for layer resnet_v2_50/block2/unit_1/bottleneck_v2/shortcut/BiasAdd
in the model, before changes it takes approximately 2.07ms to run but with modified kernel I still get a similar number:
I /root/mace/mace/utils/statistics.cc:349] | Conv2D | 47.890 | 2.024 | 2.024 | 2.107 | 20.632 | 93.512 | [1,1] | SAME | [512,256,1,1] | [1,38,38,512] | [1,1] | resnet_v2_50/block2/unit_1/bottleneck_v2/shortcut/BiasAdd |
I have force-captured the event timing and the kernel execution time for convolution 2d 1x1 is correctly reported. Concretely, I changed these lines
mace/mace/core/runtime/opencl/opencl_runtime.cc
Lines 755 to 762 in 9a06864
to
void OpenCLRuntime::GetCallStats(const cl::Event &event, CallStats *stats) {
if (stats != nullptr) {
std::cout << "~~ caught call time " << event.getProfilingInfo<CL_PROFILING_COMMAND_END>()
<< "-" << event.getProfilingInfo<CL_PROFILING_COMMAND_START>()
<< "=" << event.getProfilingInfo<CL_PROFILING_COMMAND_END>() - event.getProfilingInfo<CL_PROFILING_COMMAND_START>() << std::endl;
stats->start_micros =
event.getProfilingInfo<CL_PROFILING_COMMAND_START>() / 1000;
stats->end_micros =
event.getProfilingInfo<CL_PROFILING_COMMAND_END>() / 1000;
}
}
and the reported time matches the benchmark result.
~~ caught call time 1608608358349690112-1608608358347666944=2023168
But the kernel, when executed in another test-bench, should have spent around 2.60ms to execute, using also event timing.
Steps to reproduce the problem:
Copy the kernel and feed in with the following arguments and work-sizing to OpenCL kernel (collected from the implementation and the schedule helper):
cl::NDRange globalSize = cl::NDRange(128, 10, 38);
cl::NDRange localSize = cl::NDRange(8, 10, 12);
uint32_t idx = 0;
kernel.setArg(idx++, int32_t(128));
kernel.setArg(idx++, int32_t(10));
kernel.setArg(idx++, int32_t(38));
kernel.setArg(idx++, input);
kernel.setArg(idx++, filter);
kernel.setArg(idx++, bias);
kernel.setArg(idx++, output);
kernel.setArg(idx++, int32_t(0)); // relux_max_limit
kernel.setArg(idx++, int32_t(0)); // leakyrelu_coefficient
kernel.setArg(idx++, int32_t(38)); // Height
kernel.setArg(idx++, int32_t(38)); // Width
kernel.setArg(idx++, int32_t(64));
kernel.setArg(idx++, int32_t(38)); // Height
kernel.setArg(idx++, int32_t(38)); // Width
kernel.setArg(idx++, int32_t(1));
kernel.setArg(idx++, int32_t(1));
The following PowerShell scripts are used to run the benchmark.
SetUp-Mace.ps1
param(
[switch] $Clear,
[switch] $Rebuild
)
$BuildDir = "./build/cmake-build/arm64-v8a"
$Base = "/data/local/tmp/mace-dist"
if ($Rebuild) {
if ($Clear) {
docker exec -ti mace-dev sh -c "cd ~/mace/ && bash ./tools/clear_workspace.sh"
}
docker exec -ti mace-dev sh -c "export PATH=~/cmake-3.19.2-Linux-x86_64/bin:`$PATH && cd ~/mace/ && RUNTIME=GPU QUANTIZE=OFF bash ./tools/cmake/cmake-build-arm64-v8a.sh"
} else {
Write-Host "NOTE: MACE is not rebuilt"
}
docker cp mace-dev:/root/mace/build/cmake-build/arm64-v8a/install .
adb shell rm $Base -r
adb shell mkdir $Base
adb push ./install/bin $Base
adb push "$env:ANDROID_NDK_HOME/sources/cxx-stl/llvm-libc++/libs/arm64-v8a/libc++_shared.so" $Base/lib/libc++_shared.so
adb shell chmod -R 777 $Base
Run-ModelBenchmark.ps1
param(
[switch] $Rebuild
)
$Base = "/data/local/tmp/mace-dist"
if ($Rebuild) {
./SetUp-Mace.ps1 -Rebuild
}
$RunCmd = "
MACE_OPENCL_PROFILING=1
LD_LIBRARY_PATH=$Base/lib
$Base/bin/mace_run
--device='GPU'
--input_data_format='NHWC'
--output_node='resnet_v2_50/predictions/Reshape_1'
--output_shape='1,1001'
--model_data_file='/data/local/tmp/mace_run/resnet_v2_50/resnet_v2_50.data'
--input_node='input'
--model_file='/data/local/tmp/mace_run/resnet_v2_50/resnet_v2_50.pb'
--model_name='resnet_v2_50'
--output_data_format='NHWC'
--input_shape='1,299,299,3'
--input_file=/data/local/tmp/mace_run/resnet_v2_50/validate_in/resnet_v2_50
--output_file=/data/local/tmp/mace_run/resnet_v2_50/validate_out/resnet_v2_50
--benchmark
--round=1
--gpu_perf_hint=0
--gpu_priority_hint=0
--cpu_affinity_policy=0
"
$RunCmd = $RunCmd.Split() -join ' '
adb shell $RunCmd
Run: .\Run-ModelBenchmark.ps1 -Rebuild > set.log
Error information / logs
@PENGUINLIONG What is your shell command for benchmark?
@lu229 It's been attached at the end of the text. It's passed in via adb shell
.
@PENGUINLIONG In the MACE code, if you changed the opencl code, MACE will recompile the opencl code.
I want to know what is your command and Maybe you changed the MACE code and produced a bug?
@lu229 It's in the issue description text..
MACE_OPENCL_PROFILING=1
LD_LIBRARY_PATH=/data/local/tmp/mace-dist/lib
/data/local/tmp/mace-dist/bin/mace_run
--device='GPU'
--input_data_format='NHWC'
--output_node='resnet_v2_50/predictions/Reshape_1'
--output_shape='1,1001'
--model_data_file='/data/local/tmp/mace_run/resnet_v2_50/resnet_v2_50.data'
--input_node='input'
--model_file='/data/local/tmp/mace_run/resnet_v2_50/resnet_v2_50.pb'
--model_name='resnet_v2_50'
--output_data_format='NHWC'
--input_shape='1,299,299,3'
--input_file=/data/local/tmp/mace_run/resnet_v2_50/validate_in/resnet_v2_50
--output_file=/data/local/tmp/mace_run/resnet_v2_50/validate_out/resnet_v2_50
--benchmark
--round=1
--gpu_perf_hint=0
--gpu_priority_hint=0
--cpu_affinity_policy=0
Okay, I see. We just realized the obfuscator mace/python/tools/encrypt_opencl_codegen.py
is not automatically run.
But the centric issues is that our reproduction's timing does not match MACE's. I wonder if there are any OpenCL setting I need to set up to obtain the same result?
@PENGUINLIONG Sorry, I don't know what is your mean.
"But the centric issues is that our reproduction's timing does not match MACE's. I wonder if there are any OpenCL setting I need to set up to obtain the same result?"
What is the mean of "our reproduction's timing does not match MACE's"?