Improve build speed for stdlib

Question

Improve build speed for stdlib

Closed this issue 2 months ago · 2 comments

Ever since adding BLAS/LAPACK to stdlib the build times have exploded.

I have observed in my local machine (Windows + gfortran + CMake + Ninja with 16 threads) that the build takes 199 seconds. Using this tool https://ui.perfetto.dev/ as suggested by @ivan-pi it is possible to load the .ninja_log file and observe where are the bottle necks:

In the figure we observe the build chocking up at the lapack per kind files.

I started a major refactoring of stdlib's BLAS/LAPACK here with help from @ivan-pi and @perazz : https://github.com/jalvesz/stdlib/tree/split_lapack using a python script to rewrite the blas/lapack-per-kind files into submodule files per groups.

Doing that I managed to get down the build time to 38 seconds (x5 faster):

There are still some spots in which the build stalls but it is quite evident that by splitting into several submodule files enables better parallel build.

Before opening a PR, I would like to gather some broad opinions on this refactoring.

Some observations/open questions:

In order to achieve this splitting, the refactor_blaslapack_subm.py script contains two dictionaries blas_groups and lapack_groups. They define the procedures to be included in each submodule file. This splitting started by following the lapack reference manual, then I had to arbitrarily split further. I'm sure that with a proper dependency graph it could be possible to achieve a better grouping. If anyone feels like proposing a mechanism to find such grouping that would be very nice. Here a couple of references I've found interesting about such topic: https://paulcavallaro.com/blog/optimizing-function-placement-with-call-chain-clustering/ & https://medium.com/@avidaneran/using-a-graph-representation-to-analyze-python-dependencies-a57cd681fa09
In the draft branch, the per-kind files were moved to a legacy folder. When opening the PR I think these should not be there as there is no point in having so many duplicates of the sources. Nonetheless, the splitting script can be improved and it is based on those files. Should these files be saved in a separate repo for reference?
The submodules approach is proving very neat to organize and increase build performance. Now, this implies a very large header module file in order to have all interfaces for lapack functions. A better work on (1) could probably lead to finding clearer macro dependencies which could enable splitting also this header module.
While performing this refactoring I noticed that there are some procedures which I don't understand why should they be kept such as lamc3, which simply does c = a + b ... is there a reason to actually keep such a function?

cc: @perazz @jvdp1 @fortran-lang/stdlib

Answer 1 · 2024-12-14T08:16:44.000Z

Thank you @jalvesz @ivan-pi, this would be a nice qol upgrade for developers that build the library often (such as myself lol). I support this feature. I believe the 64-bit version should be also taken into account, as on-going at #888.

Answer 2 · 2024-12-26T11:18:11.000Z

These are very illuminating findings. When it comes to stdlib, I think we need to keep in mind that:

(public) modules should remain stable if possible, and only expose as much as needed
sub-modules are essentially just implementation details

Currently, the modules provided are stdlib_blas and stdlib_lapack. Since BLAS is small, I think it is okay to provide all procedures. The Reference LAPACK on the other hand has about ~2000 procedures. If possible, I would only make public the driver and computational routines. Auxiliary routines routines would remain an implementation detail, or can be made public upon request.

I think it is okay to have stdlib_lapack export all drivers and computational routine interfaces. To address problem 3, we can always split into smaller modules later, in a backward compatible way.

! current way
module stdlib_lapack
   interface
      ! ... all the drivers and computational procedures ...
      ! ... but not auxiliary routines ...
   end interface
end module

In the future this can be split by topic, e.g.

module stdlib_lapack_lu
   interface
      ! ... LU-related procedures ...
   end interface
end module

module stdlib_lapack
   use stdlib_lapack_lu
   ! ...

   interface
     ! ... procedures which fit no where else
   end interface
end module

I think the auxiliary routines should remain private, because they are not described in the LAPACK user's guide. We do not want to expose more public API surface than necessary.

Concerning remark 4 in your list, I think sometimes such procedure are introduced to counteract unwanted compiler optimizations, say by hiding an expression behind an external procedure, to make sure it doesn't get optimized away.