fp16 winograd
clarencewxl opened this issue · 1 comments
clarencewxl commented
In the paper, you mentioned that the implementation can be ported to fp16 version.
So, have you succeed in implementing fp16 Winograd with tensor-core and beating the performance of the cudnn.
I found that the cudnn doesn't have fp16 Winograd convolution3x3 but only fp16 gemm convolution3x3. I have no idea why Nvidia doesn't implement one.
daadaada commented
Hi.
I have not implemented fused Tensor Core fp16 Winograd yet.
I believe cuDNN's non-fused Winograd leverages Tensor Core.