Improve build speed for stdlib
Closed this issue · 2 comments
Ever since adding BLAS/LAPACK to stdlib the build times have exploded.
I have observed in my local machine (Windows + gfortran + CMake + Ninja with 16 threads) that the build takes 199 seconds. Using this tool https://ui.perfetto.dev/ as suggested by @ivan-pi it is possible to load the .ninja_log file and observe where are the bottle necks:
In the figure we observe the build chocking up at the lapack per kind files.
I started a major refactoring of stdlib's BLAS/LAPACK here with help from @ivan-pi and @perazz : https://github.com/jalvesz/stdlib/tree/split_lapack using a python script to rewrite the blas/lapack-per-kind files into submodule files per groups.
Doing that I managed to get down the build time to 38 seconds (x5 faster):
There are still some spots in which the build stalls but it is quite evident that by splitting into several submodule files enables better parallel build.
Before opening a PR, I would like to gather some broad opinions on this refactoring.
Some observations/open questions:
-
In order to achieve this splitting, the
refactor_blaslapack_subm.py
script contains two dictionariesblas_groups
andlapack_groups
. They define the procedures to be included in each submodule file. This splitting started by following the lapack reference manual, then I had to arbitrarily split further. I'm sure that with a proper dependency graph it could be possible to achieve a better grouping. If anyone feels like proposing a mechanism to find such grouping that would be very nice. Here a couple of references I've found interesting about such topic: https://paulcavallaro.com/blog/optimizing-function-placement-with-call-chain-clustering/ & https://medium.com/@avidaneran/using-a-graph-representation-to-analyze-python-dependencies-a57cd681fa09 -
In the draft branch, the per-kind files were moved to a
legacy
folder. When opening the PR I think these should not be there as there is no point in having so many duplicates of the sources. Nonetheless, the splitting script can be improved and it is based on those files. Should these files be saved in a separate repo for reference? -
The submodules approach is proving very neat to organize and increase build performance. Now, this implies a very large header module file in order to have all interfaces for lapack functions. A better work on (1) could probably lead to finding clearer macro dependencies which could enable splitting also this header module.
-
While performing this refactoring I noticed that there are some procedures which I don't understand why should they be kept such as
lamc3
, which simply doesc = a + b
... is there a reason to actually keep such a function?
These are very illuminating findings. When it comes to stdlib, I think we need to keep in mind that:
- (public) modules should remain stable if possible, and only expose as much as needed
- sub-modules are essentially just implementation details
Currently, the modules provided are stdlib_blas
and stdlib_lapack
. Since BLAS is small, I think it is okay to provide all procedures. The Reference LAPACK on the other hand has about ~2000 procedures. If possible, I would only make public the driver and computational routines. Auxiliary routines routines would remain an implementation detail, or can be made public upon request.
I think it is okay to have stdlib_lapack
export all drivers and computational routine interfaces. To address problem 3, we can always split into smaller modules later, in a backward compatible way.
! current way
module stdlib_lapack
interface
! ... all the drivers and computational procedures ...
! ... but not auxiliary routines ...
end interface
end module
In the future this can be split by topic, e.g.
module stdlib_lapack_lu
interface
! ... LU-related procedures ...
end interface
end module
module stdlib_lapack
use stdlib_lapack_lu
! ...
interface
! ... procedures which fit no where else
end interface
end module
I think the auxiliary routines should remain private, because they are not described in the LAPACK user's guide. We do not want to expose more public API surface than necessary.
Concerning remark 4 in your list, I think sometimes such procedure are introduced to counteract unwanted compiler optimizations, say by hiding an expression behind an external procedure, to make sure it doesn't get optimized away.