flame/blis

Discussion for new level-1v/-1m-like operations

fgvanzee opened this issue ยท 27 comments

Let's discuss new operations that we might like to add to BLIS, specifically those that would fall into level-1v or level-1m families (and perhaps level-2):

  • element-wise vector/matrix multiplication
    • fused element-wise vector ops
  • element-wise vector/matrix division
  • vector reduction
  • partial matrix reductions
    • reduce rows to vector
    • reduce columns to vector
  • "matrix-times-diagonal-matrix", e.g. C_ij = A_ij * b_j
  • c = diag(A*trans?(B)) (c_i = A_ij*B_ji or A_ij*B_ij)
  • mixed-precision/mixed-domain level1/2
  • parallelization for level1/2

We will add to this list as the discussion unfolds!

Not sure if these are in scope but, a function like einsum would be useful; ensum is a dependency function for gaussian mixture modeling (and adjacent algorithms).

Generally, any functionality in BLIS that could reflect, or be utilized to implement, functionality found in numpy would be especially valuable: https://numpy.org/doc/stable/reference/routines.linalg.html

@ct-clmsn some of the things I have in mind can be used as kernels for various combinations of einsum parameters. Providing einsum itself is almost certainly out-of-scope, but yes we'd definitely like to be able to provide all of the basic operations that einsum would use internally. Is there some taxonomy/list of these?

@devinamatthews I'll have to dig through the source to find them, shouldn't take too much time. Will get back with a list soon.

Source implementation is here.

Looks like dot product is the only thing happening under the hood get_sum_of_products_function ; the indexing mechanisms, which work a little bit like a simple domain specific language for describing looping mechanics/loads/stores, is where the 'magic' happens. So it's out of scope.

There's an optimization effort for the functionality that's worth noting

I found the performance of operators in 1f and 2 (axpyf, gemv, etc.) are lower than expected at SKX. It seems the current kernels are not designed under the context of AVX512. Shall we take AVX512 feature in the the planning level-1v/-1m?
And, how about multi-threading in the planning work? I'm not sure if it is a good feature for level-1f/-1m but very interested in this issue.

I found the performance of operators in 1f and 2 (axpyf, gemv, etc.) are lower than expected at SKX. It seems the current kernels are not designed under the context of AVX512. Shall we take AVX512 feature in the the planning level-1v/-1m?

Fair question. I think for now we're only gathering candidate operations to add to BLIS. Microarchitectures and instruction sets for which to optimize those (or existing) operations is mostly orthogonal IMO.

And, how about multi-threading in the planning work? I'm not sure if it is a good feature for level-1f/-1m but very interested in this issue.

Multithreading for non-level-3 operations is one of those action items we've wanted for a long time, but not badly enough relative to everything else we want. ๐Ÿ˜‚ It will happen someday!

Good initiative, I would add to the original list:

  • a, b, c, d, e are vectors (real or complex)
  • element-wise dual-multiply dual-add/sub:
    • e = (conj?(a)*b) + (conj?(c)*d)
    • e = (conj?(a)*b) - (conj?(c)*d)
    • e = -(conj?(a)*b) - (conj?(c)*d)
  • element-wise dual-multiply dual-add/sub accumulate:
    • e += (conj?(a)*b) + (conj?(c)*d)
    • e += (conj?(a)*b) - (conj?(c)*d)
    • e += -(conj?(a)*b) - (conj?(c)*d)
  • element-wise vector type-conversion (narrow, widen). Does copyv kernel already support type-conversion ?
    • a = type(conj?(b)) where type() can be S, D, C, Z and Half (H) and Complex-half (CH)
  • Ideally, I would like to see fp16 Half (H) and Complex-half (CH) in the BLAS/BLIS API as well.

@hominhquan for 2./3., are these building blocks for something like e = \sum_i conj?(a_i)*b_i?

@hominhquan 4./5. are on the radar!

@realab-ai thanks for letting us know about this! Would you mind opening an issue for this with some basic performance comparison data? I think that actually using the AVX2 level1v/f kernels from Zen could help a lot. I've heard conflicting things about the benefits of AVX512 for such operations due to the more severe throttling.

@ct-clmsn I created a taxonomy of "feasible" unary and binary tensor operations in TBLIS, and these reduce naturally to matrix ops. That actually is sort of what prompted this issue. Hopefully those same operations would be those that slot into opt_einsum as basic kernels.

@hominhquan for 2./3., are these building blocks for something like e = \sum_i conj?(a_i)*b_i?

Kind of. I've seen some DSP algorithms on which they can be useful. They are like the next-gen of FMA operations with more flops.

@realab-ai thanks for letting us know about this! Would you mind opening an issue for this with some basic performance comparison data? I think that actually using the AVX2 level1v/f kernels from Zen could help a lot. I've heard conflicting things about the benefits of AVX512 for such operations due to the more severe throttling.

Ok. I'll do this recently.

@hominhquan Cool. I ask because if the goal is N simultaneous element-wise multiplications where N > 1 but not necessary just 2 then this is more like a level1f kernel than level1. I'll add to the list.

Native support for einsum would be awesome. Don't limit yourselves just because the creators of BLAS stopped at two indices in the 1980s. Why shouldn't BLIS try to solve directly one of the most widely used linear algebra APIs in all of computing?

Another feature I'd like to see: for element-wise operation, since there is no reduction nor (backward) spatio-temporal dependency in memory access like level-3, to support user-defined custom ukernels to be applied on each (vectorized) chunk of inputs (like what @devinamatthews has done in level-3):

for ichunk in 0 .. (VEC_LENGTH/SIMD_LENGTH -1)
   simd_t  achunk = *((simd_t *) &a[ichunk*SIMD_LENGTH]);
   simd_t  bchunk = *((simd_t *) &b[ichunk*SIMD_LENGTH]);
   simd_t  cchunk = *((simd_t *) &c[ichunk*SIMD_LENGTH]);

   simd_t  rchunk = user_ukr_3inputs_simd(achunk, bchunk, cchunk);

   *((simd_t *) &res[ichunk*SIMD_LENGTH]) = rchunk;
end

// trailing
for i in ...
   ctype r = user_ukr_3inputs(a[i], b[i], c[i]);
   ...
end
  • User are then able to implement whatever they want to do in user_ukr_3inputs_simd(), whenever the fusion gain is worthy compared to the function-call cost.
rvdg commented

I introduced invscal in libflame, and that operation is now in BLIS as well.

It performance x := 1/alpha x. Much cleaner than passing 1/alpha into scal.

About 10 years ago while reading lapack code, I ran across some specialized code in an edge case. It is possible that 1/alpha underflows (meaning 1/alpha x yields the zero vector) even though dividing all elements of x by alpha would give you the expected result. What I saw them do was:

Call a routine, SLAMCH, that returns the largest value that underflows when inverted.

Check alpha against this value.

If it is ok, then scal is called with 1/alpha

Otherwise, a loop is executed to do individual divides.

This suggests:

  • If not already in BLIS, we need "level 0" routines that return things like the value where underflow happens.

  • a invscalx routines that does what is described above.

Now if only I could remember in what lapack routine this happens...
UPDATE: an example is in sgetrf: https://netlib.org/lapack/explore-html/d8/ddc/group__real_g_ecomputational_gab9b698b35b884b4d17b9713e76fabe37.html

Would sparse operations be on the table? NIST has an existing implementation available here that could be used as a template.

I found all the band relevant operations (GB,SB,HB,TB) belonging to level-1/2 are only in f2c version instead of a blis implemented. What's the reason behind this situation?

The banded operations, like the other level2 operations, are memory bandwidth limited. Also, because of the band structure they are somewhat more difficult to optimize (you get some flavor of triangular operations e.g.). So, they basically never became a target of more in-depth optimization and inclusion in the BLIS interface. Are you interested in BLIS-style banded operations (with separate row and column strides, conjugation without transposition, etc.)?

  • User are then able to implement whatever they want to do in user_ukr_3inputs_simd(), whenever the fusion gain is worthy compared to the function-call cost.

Take a look at the attached file, in particular the bit at the end.
vec_util.h.txt

I use this to vectorize chunks of "level1-like" code in C++ using lambdas, even quite complicated code with function calls etc.

The banded operations, like the other level2 operations, are memory bandwidth limited.

Successive band reduction algorithms based on two-sided orthogonal transformations can be reorganized to expose higher, "level3-like" arithmetic intensity.

For Householder-based approaches, the resulting kernels can be efficiently expressed as a sequence of TRMM and SYMM/GEMM calls (compressed band layouts handled by stride manipulation) --- see Fig. 1. The last reduction step (to bidiagonal, tridiagonal, or Hessenberg form) will of course be level2-like. As a complementary optimization, you can chase a train of bulges to improve locality --- see Fig. 2. All of this can be handled at the libflame/LAPACK level. I don't really think there is much to be done within BLIS for this. (EDIT: Actually, IIRC, BLIS is missing "native TRMM", so maybe this is a gap?)

For Givens-based approaches, it's a different story, and custom kernels would be beneficial. See, e.g.,
https://www.cs.utexas.edu/users/flame/pubs/RestructuredQRTOMS.pdf
(This is in the context of QR iteration on bidiagonal/tridiagonal matrices, but it's a similar flavor to the successive band reduction case. I can provide more references if desired.)

This comment is, of course, completely off-topic for the present discussion on level1 stuff.

For Givens-based approaches, it's a different story, and custom kernels would be beneficial.

Definitely. Givens kernels (including fused Givens kernels) were always something that I anticipated we would revisit once BLIS "settled down." Of course, here we are, 10+ years later, and it's still settling. ๐Ÿ˜‚

Thanks for reminding me of this, @nick-knight!

For Givens-based approaches, it's a different story, and custom kernels would be beneficial. See, e.g., https://www.cs.utexas.edu/users/flame/pubs/RestructuredQRTOMS.pdf (This is in the context of QR iteration on bidiagonal/tridiagonal matrices, but it's a similar flavor to the successive band reduction case. I can provide more references if desired.)

Now Iโ€™m looking for a solution of custom band-like kernel which is hoped to bring benefits to Givens-based successive band reduction case. I would be appreciate if there's more references about this, @nick-knight.

Sorry that the request is off-topic. Would you please share further information to : liheng19@mails.tsinghua.edu.cn