ROCm/rocWMMA

RDNA3 WMMA‘s peak performance

Closed this issue · 1 comments

miloee commented

I apologize for not knowing where to ask the question.
I found architecture details on this link: tomshardware.com.
The slides show that WMMA instruction can only achieve 2x fp16/bf16 performance compared to vector multiply-and-add fp32.
However, the same instruciton for MI100 can achieve 8x performance.

I would like to know the peak performance for matrix multiplication for RDNA3. Thank you.

Hi, thanks for reaching out to us.

You can find some info for CDNA here: https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/
and RDNA here: https://gpuopen.com/learn/wmma_on_rdna3/

Be sure to read both, compare your notes and keep in mind the significant difference in clocks, data type support and architectural details (e.g. wave size) that can influence performance.

There is a forum to ask more questions on the above topics: https://github.com/amd/amd-lab-notes/discussions

For all other things rocWMMA, we are happy to answer right here :)

Cheers!