Qualcomm Adreno TVM Evaluation Repo

***Disclaimer: This is a development repository, texture memory support is currently in the upstreaming process and has been cut from the TVM subtree contained herein. See the TVM discuss forum RFC for more information and links to relevant PRs.

Last version of TVM this was evaluated on and worked (01/28/2021): 4abbe4902e451cc5a963b8b60a70e548d48ace62.

For testing texture memory support, please use the tvm repository included as a subtree in this repository: tvm.

Questions of issues using the scripts? Submit a ticket via the OctoML helpdesk.

Testing model performance with texture memory:

# float16 compute, float16 accumulate
python ./scripts/evaluate.py -m mobilenetv1 -t float16 -k android --target="opencl --device=adreno" -l ./logs/mobilenetv1.texture.float16.acc16.autotvm.log

# float16 compute, float32 accumulate
python ./scripts/evaluate.py -m mobilenetv1 -t float16 -k android --target="opencl --device=adreno" -l ./logs/mobilenetv1.texture.float16.acc32.autotvm.log

# float16 compute, float16 accumulate
python ./scripts/evaluate.py -m resnet50 -t float16 -k android --target="opencl --device=adreno" -l ./logs/resnet50.texture.float16.acc16.autotvm.log

# float16 compute, float32 accumulate
python ./scripts/evaluate.py -m resnet50 -t float16 -k android --target="opencl --device=adreno" -l ./logs/resnet50.texture.float16.acc32.autotvm.log

Refer to the below instructions for running the scripts/evaluate.py script for more information

Running texture.py tests:

scripts/texture.py is a set of compute and schedule definitions for various workloads employing texture memory cache stage when the -m "texture" argument is supplied. For each test, numerical comparisons are checked against numpy results. Some of the tests can be tuned with the --tune flag. Log files with autotvm tuning records exist in the logs/ directory for many these tunable tests. See the below for a few invocation examples on how to run a tuned schedule with texture memory.

usage: scripts/texture.py [-h] [-m MEMORY] [-s] [-l LOG] [-T] -t TEST
                  [-r RPC_TRACKER_HOST] [-p RPC_TRACKER_PORT] [-k RPC_KEY]

Set test arguments

optional arguments:
  -h, --help            show this help message and exit
  -m MEMORY, --memory MEMORY
                        Use global or texture
  -s, --shared          Use shared memory
  -l LOG, --log LOG     AutoTVM tuning record logfile
  -T, --tune            Whether to tune or not
  -t TEST, --test TEST  Selected test to run
  -r RPC_TRACKER_HOST, --rpc_tracker_host RPC_TRACKER_HOST
                        RPC tracker host IP address
  -p RPC_TRACKER_PORT, --rpc_tracker_port RPC_TRACKER_PORT
                        RPC tracker host port
  -k RPC_KEY, --rpc_key RPC_KEY
                        RPC key to use

Example invocations,

# ------------------------
# Conv2d VGG16 layer [3x3]
# ------------------------

# Memory hierarchy: shared->local
$ python scripts/texture.py -r 0.0.0.0 -p 9191 -k android --test=conv2d_NCHWc_KCRSk_tx_tune2 -l logs/conv2d_NCHWc_KCRSk_tx_tune2.autotvm.shared.log
> 115.4 GFLOPS

# Memory hierarchy: texture->shared->local
$ python scripts/texture.py -r 0.0.0.0 -p 9191 -k android --test=conv2d_NCHWc_KCRSk_tx_tune2 -l logs/conv2d_NCHWc_KCRSk_tx_tune2.texture.shared.autotvm.best.log -m texture -s
> 116.9 GFLOPS

# Memory hierarchy: texture->local
$ python scripts/texture.py -r 0.0.0.0 -p 9191 -k android --test=conv2d_NCHWc_KCRSk_tx_tune2 -m texture -l logs/conv2d_NCHWc_KCRSk_tx_tune2.texture.noshared.autotvm.log
> 147.6 GFLOPS

# ------------------------------
# Conv2d MobilenetV1 layer [1x1]
# ------------------------------

# Memory hierarchy: shared->local
$ python scripts/texture.py -r 0.0.0.0 -p 9191 -k android --test=conv2d_NCHWc_KCRSk_tx_tune -l logs/conv2d_NCHWc_KCRSk_tx_tune_1024.log -s
> 100.2 GFLOPS

# Memory hierarchy: texture->shared->local
$ python scripts/texture.py -r 0.0.0.0 -p 9191 -k android --test=conv2d_NCHWc_KCRSk_tx_tune -l logs/conv2d_NCHWc_KCRSk_tx_tune_1024.log -s -m "texture"
> 89.2 GFLOPS

# Memory hierarchy: texture->local
$ python scripts/texture.py -r 0.0.0.0 -p 9191 -k android --test=conv2d_NCHWc_KCRSk_tx_tune -l logs/conv2d_NCHWc_KCRSk_tx_tune.texture.noshared.log -m "texture"
> 137.5 GFLOPS

Setting up the host development machine

On the host machine (typically your development box) you'll need to build TVM.

git clone https://github.com/apache/incubator-tvm.git --recursive
cd incubator-tvm
mkdir build
cp cmake/config.cmake build/.
echo 'set(USE_LLVM llvm-config)' >> build/config.cmake
echo 'set(USE_GRAPH_RUNTIME_DEBUG ON)' >> build/config.cmake
cd build
cmake ..
make -j8

Cross compiling the C++ RPC server for Android

Refer to the documentation here to cross compile the C++ RPC binary and tvm_runtime libraries for Android.

To run, use adb to push the cross compiled tvm_rpc binary and libtvm_runtime.so shared library to /data/local/tmp on the Android device. Then run the RPC server with:

adb shell
cd /data/local/tmp
LD_LIBRARY_PATH=. ./tvm_rpc server --tracker=<tracker IP>:<tracker port> --key=android

Setting up the RPC device tracker

Once TVM is built on the host, you'll need to launch the RPC tracker service with the following command:

python -m tvm.exec.rpc_tracker --host=<tracker IP> --port=<tracker port> --port-end=9192

Where tracker IP is the host IP, and tracker port can be 9191.

When done, you can register the Android device on the tracker with the same key used to run the on device RPC server:

python -m tvm.exec.rpc_server --tracker <tracker host>:<tracker port> --key android

Finally, make sure that the hardware is properly registered to the tracker. On the host, or any machine connected to the local network, check the devices registered on the tracker with the following command:

python -m tvm.exec.query_rpc_tracker --host <tracker IP> --port <tracker port>

Using the experiment script

Under scripts you'll find a python script evaluate.py that can evaluate or tune a set of models:

Below is the usage for the script, which you can get with

$ python3 scripts/evaluate.py -h

usage: evaluate.py [-h] -m
                   {resnet50,mobilenetv1,inceptionv3,vgg16,mobilenetv3_ssdlite,deeplabv3}
                   [-t {float32,float16}] [-l LOG] [-k RPC_KEY]
                   [-r RPC_TRACKER_HOST] [-p RPC_TRACKER_PORT] [-T TARGET]
                   [--tune TUNE] [--debug DEBUG]

Tune and/or evaluate a curated set of models

optional arguments:
  -h, --help            show this help message and exit
  -m {resnet50,mobilenetv1,inceptionv3,vgg16,mobilenetv3_ssdlite,deeplabv3}, --model {resnet50,mobilenetv1,inceptionv3,vgg16,mobilenetv3_ssdlite,deeplabv3}
                        Model to tune and/or evaluate
  -t {float32,float16}, --type {float32,float16}
                        Specify whether the model should be run with single or
                        half precision floating point values
  -l LOG, --log LOG     AutoTVM tuning logfile name
  -k RPC_KEY, --rpc_key RPC_KEY
                        RPC key to use
  -r RPC_TRACKER_HOST, --rpc_tracker_host RPC_TRACKER_HOST
                        RPC tracker host IP address
  -p RPC_TRACKER_PORT, --rpc_tracker_port RPC_TRACKER_PORT
                        RPC tracker host port
  -T TARGET, --target TARGET
                        Compilation target
  --tune TUNE           Whether or not to run autotuning
  --debug DEBUG         Use graph runtime debugger to output per layer perf.
                        data and other statistics

Known issues

Currently running with -m deeplabv3 -t float16 will produce an internal invariant violation in TVM. This is known and under investigation.

Moonsimon/qualcomm