The inference speed is not faster than Pytorch nn.conv2d() on GPU

Question

The inference speed is not faster than Pytorch nn.conv2d() on GPU

Opened this issue 2 years ago · 3 comments

Hi! Thanks for the excellent work! I write a script to compare the speed of BNN conv with full-precision(FP) conv implemented by Pytorch on Nvidia GPU, but I got result like this:
TAB Average GPU Time : 1.20968
FPConv Average GPU Time : 0.89210
TAB kernel is slower than full-precision conv kernel.

How can I reach the speed-up performance reported in the paper? Also, can you provide the code for model-level evaluation, such as resnet-18?

Looking forward to your reply!!

Answer 1 · 2023-02-18T03:39:41.000Z

my evaluation script:
import time
import torch
import torch.nn as nn
import torch.nn.functional as F

from tqdm import tqdm
import TAB_CUDA as TAB

def TABConv(QW,BTN_W):
KN=512
KH=3
KW=3
KC=256
## prepare activation
N=16
H=112
W=112
C=256

pad1=1
pad2=1
str1=1
str2=1

x=torch.rand([N,H,W,C])
x_ths=0.5*torch.ones([N])

conv_type=0
y=TAB.Conv2d(x.cuda(), QW.cuda(), x_ths.cuda(), BTN_W.cuda(), conv_type, pad1, pad2, str1, str2, N, H, W, C, KN, KH, KW)

return y

def TABConv_test():
time_cost=0
test_times = 20
## prepare weight
KN=512
KH=3
KW=3
KC=256
w=torch.rand([KN,KH,KW,KC])
w_ths=0.5*torch.ones([KN])
bitwidth=1
QW, BTN_W = TAB.Quantize(w.cuda(),w_ths.cuda(),bitwidth, KN, KH, KW, KC)
for i in tqdm(range(test_times)):
time_2=time.time()
TABConv(QW, BTN_W)
time_3=time.time()

    time_cost += time_3 - time_2
time_cost=time_cost/test_times
print(f'TAB Average GPU Time : {time_cost:.5f} ')

def FPConv(w):

## prepare activation
N=16
H=112
W=112
C=256

pad1=1
pad2=1
str1=1
str2=1

x=torch.rand([N,C,H,W]).cuda()

y=F.conv2d(x, w, padding=pad1)

return y

def FPConv_test():
time_cost=0
test_times = 20
## prepare weight
KCOUT=512
KH=3
KW=3
KCIN=256
w=torch.rand([KCOUT,KCIN,KW,KH]).cuda()
for i in tqdm(range(test_times)):
time_2=time.time()
FPConv(w)
time_3=time.time()

    time_cost += time_3 - time_2
time_cost=time_cost/test_times
print(f'FPConv Average GPU Time : {time_cost:.5f} ')

if name == 'main':
TABConv_test()
FPConv_test()

Answer 2 · 2023-02-20T07:02:27.000Z

Hi! Thanks for the excellent work! I write a script to compare the speed of BNN conv with full-precision(FP) conv implemented by Pytorch on Nvidia GPU, but I got result like this: TAB Average GPU Time : 1.20968 FPConv Average GPU Time : 0.89210 TAB kernel is slower than full-precision conv kernel.

How can I reach the speed-up performance reported in the paper? Also, can you provide the code for model-level evaluation, such as resnet-18?

Looking forward to your reply!!

Hi Bill He,

The TAB speedup in the paper is obtained by comparing the convolution layers (FP32, INT8, ternary, and binary) based on basic GEMM with OpenMPI and SIMD progma in C++ for a fair comparison. However, deep learning frameworks like PyTorch, TensorFlow, and JAX are built upon highly optimized industrial libraries like Intel oneAPI/MKL and Nvidia cuDNN/cuBLAS. Our simple implementation has much fewer code-level optimizations and architecture-specific tunings than these deep learning frameworks. As a result, the current version of TAB only achieves a similar speedup as PyTorch's full-precision implementation.

To achieve high speedup over PyTorch full-precision layers, we need to optimize TAB bitwise GEMM on GPU with efficient tiling, blocking, data partition, pre-fetching, etc. You can refer to TVM, cuBLAS, LLVM Polly, and other GEMM Auto-Tuning Tools for these GEMM optimizations.

Another way is waiting for Intel, Nvidia, or other companies and researchers to add binary/ternary bitwise GEMM support on the BLAS/GEMM libraries and deep learning frameworks.

Answer 3 · 2023-02-20T07:19:45.000Z

Hi! Thanks for the excellent work! I write a script to compare the speed of BNN conv with full-precision(FP) conv implemented by Pytorch on Nvidia GPU, but I got result like this: TAB Average GPU Time : 1.20968 FPConv Average GPU Time : 0.89210 TAB kernel is slower than full-precision conv kernel.

How can I reach the speed-up performance reported in the paper? Also, can you provide the code for model-level evaluation, such as resnet-18?

Looking forward to your reply!!

The GEMM-level and layer-level optimizations require heavy engineering work. As we focus on proposing an efficient framework including encoding, data storage, and dot product, we only provide the reference basic GEMM implementation to validate the theoretical performance. Also, TAB is not efficient on the NHWC and NCHW data format conversion and GPU memory allocation in the layer integration with PyTorch.

If you plan to implement the BNN code, I suggest you collaborate with companies or researchers with GEMM code-optimization experience. If you want to reuse some existing code, I recommend daBNN (https://github.com/JDAI-CV/dabnn) and TVM (https://tvm.apache.org/2018/12/18/lowprecision-conv) to you.

Finally, you can refer to this repo (https://github.com/apple/ml-quant) on the model-level evaluation. You can simply replace the nn.Conv2d with our TAB conv2d to evaluate ResNet-18.