VertexC/dl-infer-perf

tvm inference performance get worse when batchsize is larger

Opened this issue · 1 comments

Currently tvm schedules are mostly optimized for batch_size=1. To use batch_size is larger than 1, one needs to compile modle with AutoTvm.

https://discuss.tvm.apache.org/t/more-slower-use-tvm-than-mxnet-when-i-use-batch-forward/1810/2

Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (2, 3, 224, 224), 'float32'), ('TENSOR', (64, 3, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (2, 64, 224, 224), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (2, 64, 112, 112), 'float32'), ('TENSOR', (128, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32, workload=('conv2d_nchw.cuda', ('TENSOR', (2, 128, 112, 112), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.