daadaada/turingas

fp16 winograd

clarencewxl opened this issue · 1 comments

In the paper, you mentioned that the implementation can be ported to fp16 version.
So, have you succeed in implementing fp16 Winograd with tensor-core and beating the performance of the cudnn.

I found that the cudnn doesn't have fp16 Winograd convolution3x3 but only fp16 gemm convolution3x3. I have no idea why Nvidia doesn't implement one.

Hi.

I have not implemented fused Tensor Core fp16 Winograd yet.

I believe cuDNN's non-fused Winograd leverages Tensor Core.