nnstreamer/nntrainer

[ HGEMM ] Half-Precision GEMM Roadmap

skykongkong8 opened this issue · 4 comments

1. Objective

Aim of this project is to implement optimal half-precision GEMM working on armv8.2 using NEON.

2. Roadmap

Suppose a GEMM case s.t.

$$A( M , K ) * B( K , N ) = C( M , N )$$

Step1. Vanilla HGEMM

  • vanilla implementation of half-precision GEMM with NEON

Step2. Kernel-based HGEMM

  • GEMM with no transpose : A * B = C
  • GEMM with transpose A : A.T * B = C
  • GEMM with transpose B : A * B.T = C
  • GEMM with transpose AB : A.T * B.T = C
  • GEMM with scale (alpha, beta) : C = C * beta + A * ( alpha * B )

Step3. Advanced optimization

Not necessarily, but perhaps we might need them (?)

  • fused HGEMM with activation
  • asm-based kernel

3. Keep in mind that...

1. Concerns about precision

  • nvidia fp16 paper

    • Tensor Cores, evenly distributed across 80 multiprocessors.
      Each Tensor Core possesses a mixed-precision 4×4×4 matrix
      processing array which performs the operation D = A×B+C,
      where A, B, C and D are 4 × 4 matrices. The inputs A and
      B must be represented in FP16 format, while C and D can
      be represented in FP16 or in FP32 formats.
      It is also possible
      that C and D point to the same matrix.
  • hyperclova

  • gemmlowp

    • at uint_16-32 GEMM, they use up to 16 * ACC24 (don't know why)

2. Justification of optimal GEMM implementation

:octocat: cibot: Thank you for posting issue #2583. The person in charge will reply soon.

It might be better to refer to the PR number for each finished item.
I agree about Step 3. We can delay it when we have enough time.

It might be better to refer to the PR number for each finished item. I agree about Step 3. We can delay it when we have enough time.

Right.. but for detailed process update, I am managing them with >Projects/Half-Precision GEMM
Furthermore, I will definitely going to mention this issue for every PR related.

Anyone who want to discuss further about this issue can reopen this issue.
Close temporally, but will be updated time-to-time.