PyTorch Baseline perhaps too weak?
xinli-git opened this issue · 0 comments
xinli-git commented
I am wondering if the PyTorch baseline is actually optimized enough? Specifically, could you
- Remove autocast since the model is already in FP16? AutoCast would actually convert some other non-GEMM fp16 kernels in FP32 (or TF32 in the case of Ampere GPUs)
- Run some warm up iteration before measuring the inference latency (averaged across a few)? Like how you did it with TensorRT
- Use flags such as
torch.backends.cudnn.benchmark = True
before running GPU kernels.
On my local machine, just these optimizations (for lack of a better word as they are not really optimizations) would make PyTorch baseline at least 2X faster.