Help Wanted: Accelerating Scikit-Learn Algorithms with Arm Compute Library (ACL) and OpenBLAS
Closed this issue · 8 comments
I am interested in optimizing the performance of scikit-learn algorithms on Arm-based processors by leveraging the capabilities of both Arm Compute Library (ACL) and OpenBLAS. Scikit-learn is a widely used machine learning library in Python, and I believe that integrating ACL and OpenBLAS could lead to significant performance gains.
- How can I enable and utilize the hardware acceleration capabilities of ACL and the optimized CPU operations of OpenBLAS in my scikit-learn code?
- Are there any specific optimizations or custom implementations I should consider to effectively combine ACL and OpenBLAS with scikit-learn for maximum performance?
- Can I directly integrate both ACL and OpenBLAS into scikit-learn, or would I need to use alternative libraries or custom extensions?
- Are there any known use cases or best practices for harnessing the benefits of both ACL and OpenBLAS within the context of scikit-learn?
Any guidance or insights on how to efficiently utilize both Arm Compute Library and OpenBLAS to enhance the performance of scikit-learn on Arm-based hardware would be highly appreciated. Thank you for your assistance!
Are you using scikit-learn
on aarch64
? Could you please tell us more about your use case?
There is little c/c++
code in scikit-learn and it looks like most of the computation is done by numpy which provides good performance on aarch64
. See https://scikit-learn.org/dev/developers/performance.html#python-cython-or-c-c
ACL implements operators that can be used to accelerate other frameworks like tenforflow lite. For more information about this see https://github.com/ARM-software/armnn#software-overview
I don't think ACL can do much for scikit-learn
as the algorithms implemented by both libraries are different and as mentioned before most of the computation in scikit-learn
is done by numpy and other libraries like liblinear
and libsvm
Note that scikit-learn currently implements a simple multilayer perceptron in sklearn.neural_network. We will only accept bug fixes for this module. If you want to implement more complex deep learning models, please turn to popular deep learning frameworks such as tensorflow, keras and pytorch.
If you have any additional questions about scikit-learn
you will get the best answers in: https://github.com/scikit-learn/scikit-learn/issues
Hope this helps.
Hi @morgolock,
Thanks for the reply.
Are you using
scikit-learn
onaarch64
? Could you please tell us more about your use case?
Yes, I am using Scikit-learn on aarch64.
My use case is to speed-up the scikit-learn algorithms for aarch64 architecture.
There is little
c/c++
code in scikit-learn and it looks like most of the computation is done by numpy which provides good performance onaarch64
. See https://scikit-learn.org/dev/developers/performance.html#python-cython-or-c-c
From my understanding, Scikit-learn depends mostly on Numpy & Scipy. I also see that numpy uses openblas as the math library. I wanted to know If ArmCL can also be used there?
ACL implements operators that can be used to accelerate other frameworks like tenforflow lite. For more information about this see https://github.com/ARM-software/armnn#software-overview
Can we use ArmNN with Numpy/ Scikit-learn?
I don't think ACL can do much for
scikit-learn
as the algorithms implemented by both libraries are different and as mentioned before most of the computation inscikit-learn
is done by numpy and other libraries likeliblinear
andlibsvm
I see that numpy uses openblas's dgemm function for kmeans algorithm. Is there a possibility of using ArmCL/ArmNN there?
Thanks!
My use case is to speed-up the scikit-learn algorithms for aarch64 architecture.
Could you please tell us more details about the actual use case? What algorithm would you like to improve? What kind of speedup would solve your problem: 1.5x, 2x o faster?
From my understanding, Scikit-learn depends mostly on Numpy & Scipy. I also see that numpy uses openblas as the math library. I wanted to know If ArmCL can also be used there?
OpenBLAS and ACL are libraries for different things, there is however some overlapping like their corresponding implementations of GEMM. You could use ACL's GEMM in OpenBLAS which would in turn speedup numpy. Is GEMM the bottleneck in your use case? Have you done some profiling?
Can we use ArmNN with Numpy/ Scikit-learn?
No, ArmNN is a higher level ML library used for inference and relies on ACL for the actual computation. You cannot use ArmNN in scikit-learn.
I see that numpy uses openblas's dgemm function for kmeans algorithm. Is there a possibility of using ArmCL/ArmNN there?
ArmNN does not implement the operator GEMM so using this library is not an option.
ACL has the operator GEMM but there is no support for double
so it won't be possible to use it to replace the existing dgemm implementation in OpenBLAS.
ACL has a single precision implentation of GEMM, so it should be possible to use ACL's GEMM to replace the existing implementation of sgemm in OpenBLAS
Hope this helps,
Hi @morgolock,
Thanks for your reply.
Could you please tell us more details about the actual use case? What algorithm would you like to improve? What kind of speedup would solve your problem: 1.5x, 2x o faster?
For the moment, I am referring to some of the best performing scikit-learn packages like intelex.
For KMeans algorithm, I see at least 10x boost compared to the Stock version of Scikit-Learn.
My understanding is as follows,
intelex -> uses intelMKL
stock scikit-learn -> uses OpenBLAS
I have also profiled the code & I see that Kmeans uses DGEMM function from OpenBLAS.
OpenBLAS and ACL are libraries for different things, there is however some overlapping like their corresponding implementations of GEMM. You could use ACL's GEMM in OpenBLAS which would in turn speedup numpy. Is GEMM the bottleneck in your use case? Have you done some profiling?
For my use case, DGEMM is the bottleneck.
Would you be able to share any references (if any) for using ACL's GEMM?
ACL has the operator GEMM but there is no support for
double
so it won't be possible to use it to replace the existing dgemm implementation in OpenBLAS.
Oh! okay. Are there any plans of implementing dgemm
in ACL?
ACL has a single precision implentation of GEMM, so it should be possible to use ACL's GEMM to replace the existing implementation of sgemm in OpenBLAS
Yes! I was thinking of the same.
This information helps.
Thanks
Would you be able to share any references (if any) for using ACL's GEMM?
We have an example that demonstrates how to use the function NEGEMM
, please see:
https://github.com/ARM-software/ComputeLibrary/blob/main/examples/neon_sgemm.cpp
Oh! okay. Are there any plans of implementing dgemm in ACL?
No, support for the data type double
is not in the library roadmap.
Hope this helps.
Hi @shashank-fujitsu,
If you're looking for an AArch64 optimised DGEMM implementation, and the BLAS interface suits your needs, then you might also want to look at Arm Performance Libraries. It's a free download for Linux, and now MacOS - https://developer.arm.com/downloads/-/arm-performance-libraries
Hi @morgolock,
Thanks for the answer, this helps!.
Hi @nSircombe,
I have tried using ArmPL, but the performance for scikit-learn algorithms was better while using OpenBLAS than compared to ArmPL.
I have tested on Graviton3 from AWS which has support to SVE 256bit.