Benchmark timing does not change when kernel is changed

System information

OS Platform and Distribution: Linux Ubuntu 16.04.7 LTS
NDK version: 19c
MACE version: v0.13.0 (9a06864)
Python version: 3.6.3
CMake version: 3.19.2

Describe the problem

Benchmark timing does not change when kernel is changed; which also does not match the timing when an excerpt of the kernel is run in another OpenCL test-bench.

To Reproduce

The following experiments are done on a Pixel 3 XL phone, which integrated a Snapdragon 845 SoC with an Adreno 630 GPU, with Android 9 installed (build number is PD1A.180720.030). The reported OpenCL driver version is OpenCL 2.0 QUALCOMM build: Perforce changelist #3929146 Compiler E031.36.01.00 (Android). And I'm using the official docker environment for build. The model used for testing is converted from ResNet-50 v2 from MACE model zoo.

When I changed this line

mace/mace/ops/opencl/cl/conv_2d_1x1.cl

Line 66 in 9a06864

for (int in_ch_blk = 0; in_ch_blk < in_ch_blks; ++in_ch_blk) {

to

for (short in_ch_blk = 0; in_ch_blk < in_ch_blks * 100; ++in_ch_blk) {

to loop for 100-times the iterations. This should effectively increase kernel time, but the benchmarked result does not change. E.g., for layer resnet_v2_50/block2/unit_1/bottleneck_v2/shortcut/BiasAdd in the model, before changes it takes approximately 2.07ms to run but with modified kernel I still get a similar number:

I /root/mace/mace/utils/statistics.cc:349] |          Conv2D |  47.890 | 2.024 |   2.024 | 2.107 |  20.632 | 93.512 |  [1,1] | SAME |   [512,256,1,1] |  [1,38,38,512] |    [1,1] | resnet_v2_50/block2/unit_1/bottleneck_v2/shortcut/BiasAdd |

I have force-captured the event timing and the kernel execution time for convolution 2d 1x1 is correctly reported. Concretely, I changed these lines

mace/mace/core/runtime/opencl/opencl_runtime.cc

Lines 755 to 762 in 9a06864

    
           void OpenCLRuntime::GetCallStats(const cl::Event &event, CallStats *stats) { 
        
             if (stats != nullptr) { 
        
               stats->start_micros = 
        
                   event.getProfilingInfo<CL_PROFILING_COMMAND_START>() / 1000; 
        
               stats->end_micros = 
        
                   event.getProfilingInfo<CL_PROFILING_COMMAND_END>() / 1000; 
        
             } 
        
           }

to

void OpenCLRuntime::GetCallStats(const cl::Event &event, CallStats *stats) {
  if (stats != nullptr) {
    std::cout << "~~ caught call time " << event.getProfilingInfo<CL_PROFILING_COMMAND_END>()
      << "-" << event.getProfilingInfo<CL_PROFILING_COMMAND_START>()
      << "=" << event.getProfilingInfo<CL_PROFILING_COMMAND_END>() - event.getProfilingInfo<CL_PROFILING_COMMAND_START>() << std::endl;

    stats->start_micros =
        event.getProfilingInfo<CL_PROFILING_COMMAND_START>() / 1000;
    stats->end_micros =
        event.getProfilingInfo<CL_PROFILING_COMMAND_END>() / 1000;
  }
}

and the reported time matches the benchmark result.

~~ caught call time 1608608358349690112-1608608358347666944=2023168

But the kernel, when executed in another test-bench, should have spent around 2.60ms to execute, using also event timing.

Steps to reproduce the problem:

Copy the kernel and feed in with the following arguments and work-sizing to OpenCL kernel (collected from the implementation and the schedule helper):

cl::NDRange globalSize = cl::NDRange(128, 10, 38);
cl::NDRange localSize = cl::NDRange(8, 10, 12);
uint32_t idx = 0;
kernel.setArg(idx++, int32_t(128));
kernel.setArg(idx++, int32_t(10));
kernel.setArg(idx++, int32_t(38));
kernel.setArg(idx++, input);
kernel.setArg(idx++, filter);
kernel.setArg(idx++, bias);
kernel.setArg(idx++, output);
kernel.setArg(idx++, int32_t(0)); // relux_max_limit
kernel.setArg(idx++, int32_t(0)); // leakyrelu_coefficient
kernel.setArg(idx++, int32_t(38)); // Height
kernel.setArg(idx++, int32_t(38)); // Width
kernel.setArg(idx++, int32_t(64));
kernel.setArg(idx++, int32_t(38)); // Height
kernel.setArg(idx++, int32_t(38)); // Width
kernel.setArg(idx++, int32_t(1));
kernel.setArg(idx++, int32_t(1));

The following PowerShell scripts are used to run the benchmark.

SetUp-Mace.ps1

param(
    [switch] $Clear,
    [switch] $Rebuild
)

$BuildDir = "./build/cmake-build/arm64-v8a"
$Base = "/data/local/tmp/mace-dist"

if ($Rebuild) {
    if ($Clear) {
        docker exec -ti mace-dev sh -c "cd ~/mace/ && bash ./tools/clear_workspace.sh"
    }
    docker exec -ti mace-dev sh -c "export PATH=~/cmake-3.19.2-Linux-x86_64/bin:`$PATH && cd ~/mace/ && RUNTIME=GPU QUANTIZE=OFF bash ./tools/cmake/cmake-build-arm64-v8a.sh"
} else {
    Write-Host "NOTE: MACE is not rebuilt"
}
docker cp mace-dev:/root/mace/build/cmake-build/arm64-v8a/install .
adb shell rm $Base -r
adb shell mkdir $Base
adb push ./install/bin $Base
adb push "$env:ANDROID_NDK_HOME/sources/cxx-stl/llvm-libc++/libs/arm64-v8a/libc++_shared.so" $Base/lib/libc++_shared.so
adb shell chmod -R 777 $Base

Run-ModelBenchmark.ps1

param(
    [switch] $Rebuild
)

$Base = "/data/local/tmp/mace-dist"

if ($Rebuild) {
    ./SetUp-Mace.ps1 -Rebuild
}

$RunCmd = "
    MACE_OPENCL_PROFILING=1
    LD_LIBRARY_PATH=$Base/lib
    $Base/bin/mace_run
        --device='GPU'
        --input_data_format='NHWC'
        --output_node='resnet_v2_50/predictions/Reshape_1'
        --output_shape='1,1001'
        --model_data_file='/data/local/tmp/mace_run/resnet_v2_50/resnet_v2_50.data'
        --input_node='input'
        --model_file='/data/local/tmp/mace_run/resnet_v2_50/resnet_v2_50.pb'
        --model_name='resnet_v2_50'
        --output_data_format='NHWC'
        --input_shape='1,299,299,3'
        --input_file=/data/local/tmp/mace_run/resnet_v2_50/validate_in/resnet_v2_50
        --output_file=/data/local/tmp/mace_run/resnet_v2_50/validate_out/resnet_v2_50
        --benchmark
        --round=1
        --gpu_perf_hint=0
        --gpu_priority_hint=0
        --cpu_affinity_policy=0
    "

$RunCmd = $RunCmd.Split() -join ' '

adb shell $RunCmd

Run: .\Run-ModelBenchmark.ps1 -Rebuild > set.log

Error information / logs

set.log

@PENGUINLIONG What is your shell command for benchmark?

@lu229 It's been attached at the end of the text. It's passed in via adb shell.

@PENGUINLIONG In the MACE code, if you changed the opencl code, MACE will recompile the opencl code.
I want to know what is your command and Maybe you changed the MACE code and produced a bug?

@lu229 It's in the issue description text..

MACE_OPENCL_PROFILING=1
    LD_LIBRARY_PATH=/data/local/tmp/mace-dist/lib
    /data/local/tmp/mace-dist/bin/mace_run
        --device='GPU'
        --input_data_format='NHWC'
        --output_node='resnet_v2_50/predictions/Reshape_1'
        --output_shape='1,1001'
        --model_data_file='/data/local/tmp/mace_run/resnet_v2_50/resnet_v2_50.data'
        --input_node='input'
        --model_file='/data/local/tmp/mace_run/resnet_v2_50/resnet_v2_50.pb'
        --model_name='resnet_v2_50'
        --output_data_format='NHWC'
        --input_shape='1,299,299,3'
        --input_file=/data/local/tmp/mace_run/resnet_v2_50/validate_in/resnet_v2_50
        --output_file=/data/local/tmp/mace_run/resnet_v2_50/validate_out/resnet_v2_50
        --benchmark
        --round=1
        --gpu_perf_hint=0
        --gpu_priority_hint=0
        --cpu_affinity_policy=0

Okay, I see. We just realized the obfuscator mace/python/tools/encrypt_opencl_codegen.py is not automatically run.

But the centric issues is that our reproduction's timing does not match MACE's. I wonder if there are any OpenCL setting I need to set up to obtain the same result?

@PENGUINLIONG Sorry, I don't know what is your mean.
"But the centric issues is that our reproduction's timing does not match MACE's. I wonder if there are any OpenCL setting I need to set up to obtain the same result?"

What is the mean of "our reproduction's timing does not match MACE's"?

	void OpenCLRuntime::GetCallStats(const cl::Event &event, CallStats *stats) {
	if (stats != nullptr) {
	stats->start_micros =
	event.getProfilingInfo<CL_PROFILING_COMMAND_START>() / 1000;
	stats->end_micros =
	event.getProfilingInfo<CL_PROFILING_COMMAND_END>() / 1000;
	}
	}