ARM-software/ComputeLibrary

Help Wanted: Accelerating Scikit-Learn Algorithms with Arm Compute Library (ACL) and OpenBLAS

Closed this issue · 8 comments

I am interested in optimizing the performance of scikit-learn algorithms on Arm-based processors by leveraging the capabilities of both Arm Compute Library (ACL) and OpenBLAS. Scikit-learn is a widely used machine learning library in Python, and I believe that integrating ACL and OpenBLAS could lead to significant performance gains.

  1. How can I enable and utilize the hardware acceleration capabilities of ACL and the optimized CPU operations of OpenBLAS in my scikit-learn code?
  2. Are there any specific optimizations or custom implementations I should consider to effectively combine ACL and OpenBLAS with scikit-learn for maximum performance?
  3. Can I directly integrate both ACL and OpenBLAS into scikit-learn, or would I need to use alternative libraries or custom extensions?
  4. Are there any known use cases or best practices for harnessing the benefits of both ACL and OpenBLAS within the context of scikit-learn?

Any guidance or insights on how to efficiently utilize both Arm Compute Library and OpenBLAS to enhance the performance of scikit-learn on Arm-based hardware would be highly appreciated. Thank you for your assistance!

Hi @shashank-fujitsu

Are you using scikit-learn on aarch64 ? Could you please tell us more about your use case?

There is little c/c++ code in scikit-learn and it looks like most of the computation is done by numpy which provides good performance on aarch64. See https://scikit-learn.org/dev/developers/performance.html#python-cython-or-c-c

ACL implements operators that can be used to accelerate other frameworks like tenforflow lite. For more information about this see https://github.com/ARM-software/armnn#software-overview

I don't think ACL can do much for scikit-learn as the algorithms implemented by both libraries are different and as mentioned before most of the computation in scikit-learn is done by numpy and other libraries like liblinear and libsvm

Quote from
https://scikit-learn.org/dev/faq.html#why-is-there-no-support-for-deep-or-reinforcement-learning-will-there-be-support-for-deep-or-reinforcement-learning-in-scikit-learn

Note that scikit-learn currently implements a simple multilayer perceptron in sklearn.neural_network. We will only accept bug fixes for this module. If you want to implement more complex deep learning models, please turn to popular deep learning frameworks such as tensorflow, keras and pytorch.

If you have any additional questions about scikit-learn you will get the best answers in: https://github.com/scikit-learn/scikit-learn/issues

Hope this helps.

Hi @morgolock,

Thanks for the reply.

Are you using scikit-learn on aarch64 ? Could you please tell us more about your use case?

Yes, I am using Scikit-learn on aarch64.
My use case is to speed-up the scikit-learn algorithms for aarch64 architecture.

There is little c/c++ code in scikit-learn and it looks like most of the computation is done by numpy which provides good performance on aarch64. See https://scikit-learn.org/dev/developers/performance.html#python-cython-or-c-c

From my understanding, Scikit-learn depends mostly on Numpy & Scipy. I also see that numpy uses openblas as the math library. I wanted to know If ArmCL can also be used there?

ACL implements operators that can be used to accelerate other frameworks like tenforflow lite. For more information about this see https://github.com/ARM-software/armnn#software-overview

Can we use ArmNN with Numpy/ Scikit-learn?

I don't think ACL can do much for scikit-learn as the algorithms implemented by both libraries are different and as mentioned before most of the computation in scikit-learn is done by numpy and other libraries like liblinear and libsvm

I see that numpy uses openblas's dgemm function for kmeans algorithm. Is there a possibility of using ArmCL/ArmNN there?

Thanks!

Hi @shashank-fujitsu

My use case is to speed-up the scikit-learn algorithms for aarch64 architecture.

Could you please tell us more details about the actual use case? What algorithm would you like to improve? What kind of speedup would solve your problem: 1.5x, 2x o faster?

From my understanding, Scikit-learn depends mostly on Numpy & Scipy. I also see that numpy uses openblas as the math library. I wanted to know If ArmCL can also be used there?

OpenBLAS and ACL are libraries for different things, there is however some overlapping like their corresponding implementations of GEMM. You could use ACL's GEMM in OpenBLAS which would in turn speedup numpy. Is GEMM the bottleneck in your use case? Have you done some profiling?

Can we use ArmNN with Numpy/ Scikit-learn?

No, ArmNN is a higher level ML library used for inference and relies on ACL for the actual computation. You cannot use ArmNN in scikit-learn.

I see that numpy uses openblas's dgemm function for kmeans algorithm. Is there a possibility of using ArmCL/ArmNN there?

ArmNN does not implement the operator GEMM so using this library is not an option.

ACL has the operator GEMM but there is no support for double so it won't be possible to use it to replace the existing dgemm implementation in OpenBLAS.

ACL has a single precision implentation of GEMM, so it should be possible to use ACL's GEMM to replace the existing implementation of sgemm in OpenBLAS

Hope this helps,

Hi @morgolock,

Thanks for your reply.

Could you please tell us more details about the actual use case? What algorithm would you like to improve? What kind of speedup would solve your problem: 1.5x, 2x o faster?

For the moment, I am referring to some of the best performing scikit-learn packages like intelex.
For KMeans algorithm, I see at least 10x boost compared to the Stock version of Scikit-Learn.

My understanding is as follows,
intelex -> uses intelMKL
stock scikit-learn -> uses OpenBLAS

I have also profiled the code & I see that Kmeans uses DGEMM function from OpenBLAS.

OpenBLAS and ACL are libraries for different things, there is however some overlapping like their corresponding implementations of GEMM. You could use ACL's GEMM in OpenBLAS which would in turn speedup numpy. Is GEMM the bottleneck in your use case? Have you done some profiling?

For my use case, DGEMM is the bottleneck.
Would you be able to share any references (if any) for using ACL's GEMM?

ACL has the operator GEMM but there is no support for double so it won't be possible to use it to replace the existing dgemm implementation in OpenBLAS.

Oh! okay. Are there any plans of implementing dgemm in ACL?

ACL has a single precision implentation of GEMM, so it should be possible to use ACL's GEMM to replace the existing implementation of sgemm in OpenBLAS

Yes! I was thinking of the same.

This information helps.

Thanks

Hi @shashank-fujitsu

Would you be able to share any references (if any) for using ACL's GEMM?

We have an example that demonstrates how to use the function NEGEMM, please see:
https://github.com/ARM-software/ComputeLibrary/blob/main/examples/neon_sgemm.cpp

Oh! okay. Are there any plans of implementing dgemm in ACL?

No, support for the data type double is not in the library roadmap.

Hope this helps.

Hi @shashank-fujitsu,
If you're looking for an AArch64 optimised DGEMM implementation, and the BLAS interface suits your needs, then you might also want to look at Arm Performance Libraries. It's a free download for Linux, and now MacOS - https://developer.arm.com/downloads/-/arm-performance-libraries

Hi @morgolock,

Thanks for the answer, this helps!.

Hi @nSircombe,

I have tried using ArmPL, but the performance for scikit-learn algorithms was better while using OpenBLAS than compared to ArmPL.

I have tested on Graviton3 from AWS which has support to SVE 256bit.