Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace
Opened this issue ยท 0 comments
๐ Motivation and context
Performance metrics like TFLOPS (10^12 floating-point operations per second) and memory bandwidth utilization (GB per second) are crucial for optimizing the matrix multiplication operators's performance and how those operators utilize the GPU hardware. These metrics are not immediately available from the trace but can be derived from the traces using the operator input dimension, kernel execution time, etc. Thus, we request that these TFLOPS metrics be added to HTA.
Description
FLOPS calculation
Assuming a matrix multiplication
Here,
Alternatives
No response
Additional context
No response