stochasticai/x-stable-diffusion

PyTorch Baseline perhaps too weak?

xinli-git opened this issue · 0 comments

I am wondering if the PyTorch baseline is actually optimized enough? Specifically, could you

  • Remove autocast since the model is already in FP16? AutoCast would actually convert some other non-GEMM fp16 kernels in FP32 (or TF32 in the case of Ampere GPUs)
  • Run some warm up iteration before measuring the inference latency (averaged across a few)? Like how you did it with TensorRT
  • Use flags such as torch.backends.cudnn.benchmark = True before running GPU kernels.

On my local machine, just these optimizations (for lack of a better word as they are not really optimizations) would make PyTorch baseline at least 2X faster.