intel/neural-speed

Bestla Kernels understanding and benchmarking

Alavandar08 opened this issue · 8 comments

In OneDNN with Low precision datatype, we have support for u8s8s8 datatype. In Bestla Benchmark Infra we can find couple of classes for low precision types that includes (u8s8s32, s8s8s32 and some classes with different clip dtypes) - Ref: https://github.com/intel/neural-speed/blob/main/bestla/bestla/ut/bestla_benchmark.cpp

Question: With In Bestla do we have support only for output s32 (i.e u8s8s32/s8s8s32) or do we have also support for output s8 (i.e u8s8s8/s8s8s8)?


For Bestla Benchmark we have instructions here to build and benchmark with Bestla kernels (Ref: https://github.com/intel/neural-speed/tree/main/bestla#benchmark)

Question: Do we have any specific env variables that needs to be set to get best performance out of Bestla Kernels

Question: With In Bestla do we have support only for output s32 (i.e u8s8s32/s8s8s32) or do we have also support for output s8 (i.e u8s8s8/s8s8s8)?

It depends on the epilogue classes, AccumulatorWriteBackInt32 outputs int32 result while AlphaBetaProcessS32U8 outputs the u8 result.

Question: Do we have any specific env variables that needs to be set to get best performance out of Bestla Kernels

No env variable. Better run benchmark on the socket with one numanode, one CPU socket with multiple numanode has performance issue. If you are running the benchmark on the hybrid CPUs, please add this to the CMake command: -DBTLA_UT_OPENMP=OFF

Thanks @luoyu-intel for the clarification. I have some follow-up questions on the same which looks interesting.

I have been using this benchmarking infra provided in the repo
https://github.com/intel/neural-speed/tree/main/bestla --> bestla/bestla/ut/bestla_benchmark.cpp
mkdir build && cd build
cmake .. -DBTLA_UT_BENCHMARK=ON -DBTLA_UT_ALL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
./bin/bestla_benchmark

With this infra I have benchmarked bestla kernels for u8s8s32 (AccumulatorWriteBackInt32) and u8s8u8 (AlphaBetaProcessS32U8) and I have also benchmarked with OneDNN kernels using benchdnn as it also supports low precision kernels - https://github.com/oneapi-src/oneDNN/blob/main/tests/benchdnn/README.md.

The results are as follows:
Picture2

The Bestla Kernels are run with u8s8s32, OneDNN kernels are run with u8s8s8. With Bestla I have also verified with 8bit output type (i.e AlphaBetaProcessS32U8) we are observing upto 5% improvement on top of 32 bit output type.

Question 1: At Bestla side, The Benchmark Infra that is being used to get OP level timing for different ISAs (./bin/bestla_benchmark). Would like to confirm If I can proceed further with the above script/infra for more OP level analysis?


Question 2: From the above image we are observing Bestla micro kernels are not on par / performing better compared to OneDNN kernels. Would like to know what might be the reason for not observing bestla time faster than OneDNN time taken?

Parallelism

Neural speed provides functionality called tensor parallelism, Beslta also provides parallelism functionality using parallel template classes.

Question: Is parallelization taken care by Bestla or Neural Speed or Neural Speed followed by Bestla micro kernels?

@Alavandar08

Would like to confirm If I can proceed further with the above script/infra for more OP level analysis?

What do you mean "more OP level analysis"?

Would like to know what might be the reason for not observing bestla time faster than OneDNN time taken?

BesTLA was developed in a tiny group of Intel (~3 people) but has covered all ISAs since AVX2. So we are not able to make it as fast as OneDNN on arbitrary devices with arbitrary cores and arbitrary problem sizes. Our highlight is supporting other low-bits by cpp templates, like: int3,int4,int5.

Question: Is parallelization taken care by Bestla or Neural Speed or Neural Speed followed by Bestla micro kernels?

TP is done by Neural Speed. To better support Intel's new Xeon CPU, we will support it inside BesTLA.

Thanks @luoyu-intel for the quick response.

What do you mean "more OP level analysis"?

I was referring to run with more arbitrary problem sizes and observe its behavior.

So we can continue with below infra to run for arbitrary problem sizes (specifically with low-bits by cpp templates to observe its impact) ? - https://github.com/intel/neural-speed/tree/main/bestla

So we are not able to make it as fast as OneDNN on arbitrary devices with arbitrary cores and arbitrary problem sizes.

Question: Do you have any suggestions on device, cores and problem sizes where we can observe BesTLA performing better than OneDNN?

Yes, you can add the problem sizes to benchmark's source code and then compile and run it. We are not planning to provide benchdnn-like cli parameters.

Question: Do you have any suggestions on device, cores and problem sizes where we can observe BesTLA performing better than OneDNN?

I'd like to suggest work on this Scheduler class. It schedules problem sizes to each core and do the cache blocking work. It may have 10% performance impact if you optimize the schedule for one problem size.

Sure @luoyu-intel, Thanks

Here is my use case, I am trying to run llama model from hugging face with low precision data types(int8, int4) through ipex llm and other libraries. Based on above discussion

Our highlight is supporting other low-bits by cpp templates, like: int3,int4,int5.

Question 1: In order to achieve the best performance with INT8 would you suggest to use OneDNN over Bestla (As here the focus is towards other low precision data types) and compare against ipex llm?

Question 2: With INT4 dtype would you suggest to use Bestla kernels to get best performance over ipex llm ?

Question 1: In order to achieve the best performance with INT8 would you suggest to use OneDNN over Bestla (As here the focus is towards other low precision data types) and compare against ipex llm?

oneDNN requires activation reroder for many cases on both CPU and GPU, but benchdnn does not include the reorder process (as I remember). So I'm not sure about this.

Question 2: With INT4 dtype would you suggest to use Bestla kernels to get best performance over ipex llm ?

I'm not familiar with ipex llm's int4 performance.

Thanks @luoyu-intel.

As Beslta kernels performance is mostly focused on other low precision Kernels, like: int3,int4,int5

I am trying to utilize the int4 kernels from benchmark's source code and then compile (https://github.com/intel/neural-speed/tree/main/bestla --> bestla/bestla/ut/bestla_benchmark.cpp)

For Int4 to extract the time taken from arbitary sizes, I have used UTWOQ_CompInt8 class for computation with data type BTLA_DTYPE::S4_CLIP, scale types with BF16 and F32
I have noticed that the data format is int this way - Input:F32, Weights:INT4, Output:F32

auto memsize = gemm_memsize(m, n, k, BTLA_DTYPE::F32, qtype, BTLA_DTYPE::F32);

Epilogue:
I was looking at epilogue class for postprocess from F32 to low precision type(8 bit an 4 bit). Here we can find different writebacks from F32 to BF16, INT32, BF16.

using AccumulatorWriteBackFp32 = AccumulatorWriteBack<float, float>;

Question1: I was looking if we have some API that does writeback from F32 to (8 bit and 4 bit). Do we have any API which supports the above case?

Prologue:
I am trying to find something similar with Prologue class for datatype conversion from F32 to INT8 to handle the computation.
Question2: Can you help me by pointing out to the API which takes care of this specific datatype conversion?

Question3: As we have direct class for (u8s8s32, s8s8s32) do we have any class similar to that for INT4?