CVODE LAPACK support not functional but CI passes
kilicomu opened this issue · 17 comments
Hi guys.
I've been down a bit of a rabbit hole with the CVODE installation over the past couple of days, and I think there is something awry with the way in which CVODE is being installed via install_cvode.sh
.
The CI action for installing CVODE on both Ubuntu and MacOS is getting past looking for LAPACK libraries:
-- Looking for LAPACK libraries... OK
(https://github.com/AtChem/AtChem2/runs/3231872288#step:5:321)
but the LAPACK functionality test is failing:
WARNING: LAPACK not functional.
-- Checking if Lapack works... FAILED
Blas/Lapack support will not be provided.
(https://github.com/AtChem/AtChem2/runs/3231872288#step:5:328)
I think it's failing firstly because of the :
character in the LAPACK_LIBS
variable:
AtChem2/tools/install/install_cvode.sh
Lines 24 to 28 in dd25104
which you can confirm by going into the sundials-2.7.0/build/LapackTest
directory and running make
, e.g:
...
CMakeFiles/ltest.dir/build.make:98: *** target pattern contains no `%'. Stop.
make[2]: Leaving directory `/mnt/lustre/users/klcm500/AtChem2/BUILD_ISSUES/dependencies_lapack/sundials-2.7.0/build/LapackTest'
make[1]: *** [CMakeFiles/ltest.dir/all] Error 2
make[1]: Leaving directory `/mnt/lustre/users/klcm500/AtChem2/BUILD_ISSUES/dependencies_lapack/sundials-2.7.0/build/LapackTest'
make: *** [all] Error 2
What happens next, I think, depends entirely on which LAPACK / BLAS libraries are installed on your system. When I can get the LAPACK test to pass for CVODE by point it directly at an existing libopenblas.so
, I end up with undefined references to some BLAS symbols in the CVODE libraries which propagate through to the build of AtChem2.
Long story short:
- I don't think LAPACK support is working as described by the manual and as attempted to enable in CI
- The model seems to run just fine without it, so, do we want to just disable LAPACK support in the CVODE installation with
-DENABLE_LAPACK:BOOL=OFF
?
From what I can tell on our system, it works just fine without LAPACK support!
Hi @kilicomu if it's not working (and I don't think it's ever been working tbh) then I agree that we should explicitly disable it for clarity.
It would be good to use LAPACK/BLAS if possible for performance (I'd be interested to see what it does in that regard). What is the issue with the :
in the LAPACK_LIBS
variables: is CMake expecting something else to define a list of paths? My Cmake knowledge is extremely tenuous. I'm just wondering how easy it would be to correct the linkage rather than just disabling it.
The issue with the unescaped :
character is that the LAPACK_LIBS
string is ending up as a make
target:
ltest: /usr/lib/libblas.so:/usr/lib/liblapack.so
in CMakeFile/ltest.dir/build.mak
of the LapackTest
directory in the SUNDIALS build. If that is escaped and the libs are correct (I've only tried it by changing it to the path of an OpenBLAS installation on my machine, which provides BLAS and LAPACK, then you can at least link against the BLAS and LAPACK libs. I still end up with some undefined symbols in my library, but that could well be a library issue on my machine.
Thanks for looking into this @kilicomu. Regarding the question if it is better to disable or to fix it, I'd say that it depends whether using LAPACK influences the accuracy of the results, the numerical stability and/or the performance of the model.
Is BLAS/LAPACK functionality somehow related to openlibm? I.e. we implemented openlibm to improve reproducibility and accuracy, was that perhaps necessary because we didn't have LAPACK?
@kilicomu OK - would you say you'd be able to fix that easily, or not? I'm not one for CMake fixes without a lot of effort!
@rs028 openlibm was introduced to try to rectify the numerical differences we saw on Linux vs Mac. That was pretty successful in that regard, and shouldn't probably be related to LAPACK per se.
It could be related to #436, because I would hope that for the larger system sizes LAPACK would begin to come into its own in terms of solution speed vs the builtin solvers (that's the hope at least).
@rs028 I don't think that it has been working for some time, so I would be very surprised if explicitly disabling the LAPACK support in CVODE (rather than letting it silently disable itself) would affect the numerical quality of the results. It certainly won't affect the performance, as it is running without LAPACK support in this configuration either way!
If you have numerical / performance benchmarks that you already run to evaluate changes to the project, I'm happy to run those.
@spco In the case of the CI - I think that just escaping the colon in the LAPACK_LIBS
string will resolve the LapackTest
build failure. Assuming the BLAS and LAPACK libs are actually where that string says they are, the LAPACK test should then complete successfully. I'm interested to see what happens after that in CI! I can trial this change.
I'll also have a look into whether or not LAPACK support will help with #436, but, without knowing exactly what the cause of slowdown with large mechanisms is, it's difficult to predict what will happen. Let me play around a bit and get back to you on all of the above.
I actually meant the opposite :) i.e., whether implementing LAPACK could improve on the accuracy/performance.
We don't really have performance benchmarks (but we should and we have discussed profiling to identify the bottlenecks). The closest we have to numerical benchmarks see the discussion on #265 and #340.
Regarding #384 and #436 they are poorly characterized, meaning that we don't know exactly what are the causes of the problems (or in the case of #384 what is really the problem).
Another, perhaps related, thing to note is that the upgrade of CVODE is still pending (see #286).
@kilicomu I'll be very interested to see the outcome! As @rs028 says, we did plan to upgrade sundials version at some point, but didn't ever get beyond an abandoned branch - largely because I didn't have time to investigate whether the numerical changes we saw were negligible or needed further understanding. It all comes back to some of the wider issues we've had verifying the numerical accuracy beyond the simple steady-state/known solution testcases.
Okay, I've been playing around with this a bit further. Can I ask what is probably a stupid question? Are we actually using LAPACK features of the CVODE library? I've ended up having a look through the CVODE manual and the points in the AtChem2 code (particularly atchem2.f90
) where we could choose to use LAPACK implementations - it seems, at a glance, that we are specifically choosing not to use LAPACK features of CVODE.
I'm probably just misinterpreting the code! An example of what I'm describing is at
Line 395 in dd25104
FCVLAPACKDENSE
but we go with the CVODE internal implementation.I am going out on a limb here (@spco can confirm if I am wrong) but this is probably something we inherited from the original code. Maybe the original developers ran into the same kind of issue you have been battling.
I think the question now is do we want to use LAPACK/BLAS or not? I would say yes if it improves the model performance somehow (i.e. speed or precision or something else). If not, might as well drop it entirely (one less dependency to worry about).
I don't know enough about this to make the call, so I am happy to follow your advice.
@rs028 A quick look at the blame for atchem2.f90
seems to confirm your suspicion! The solver code was added in the initial commit and further commits look to have reformatted it a few times.
I think that LAPACK features of the solver library are definitely worth investigating, but there is other work to be done before it's worthwhile. My advice is:
- explicitly disable LAPACK in the CVODE build script
- remove the LAPACK setup references from the documentation
- establish some baseline numerical tests that help to inform whether or not a change to the model is detrimental to its correctness (I guess you already have these, I just haven't properly looked through the test suite!)
- migrate from SUNDIALS 2.7.0 to latest SUNDIALS (5.7.0 - this looks like a big piece of work)
- look into LAPACK features as part of the migration to SUNDIALS 5.7.0
In the average case, I think the model is running just fine with the internal CVODE solvers. There may be some edge cases, e.g. #436, where LAPACK features could help, but I'm really just speculating there.
@kilicomu makes sense to me. If you can take care of point 1, I can do point 2 (I have to do other changes to the manual anyway).
For point 3, yes we have two or three basic numerical tests, but it will certainly need to be expanded/improved - we have been thinking about a major overhaul of the whole testing system for a while now, there are several issues open along those lines, Eventually we'll get to that!
For point 4 (and 5), as Sam said there is some initial work in PR #286, but it is probably better to extend the test suite (ie, work on pioint 3) first. @spco would you agree?
One more thing: am I wrong in concluding that BLAS would also not be needed if we disable LAPACK in the CVODE build?
Happy to sort out point 1 in the near future and to be involved in 3 - 5, it's just a matter of finding time!
I don't think you are wrong; BLAS shouldn't be needed.
Cool. FYI, the open issues are also organized , with additional notes, in the Roadmap (https://github.com/AtChem/AtChem2/projects/1).
Hi folks, and thanks @kilicomu for picking this up. You're probably bang on about the inherited code, and I agree with your plan of action - 1 and 2 should happen.
On 3, we have a variety of tests we run. test/unit_tests
do what they say on the tin (limited code coverage) while test/tests
are end-to-end tests. From memory, static
, firstorder
and secondorder
are known solutions, (steady-state reaction, linear reaction etc - see the top of secondorder.fac
for the exact solution. The rest are attempting to increase the code coverage with things like choice of environment variables, photolysis rates etc. So I'm pretty confident in the known-solution cases, but the rest are probably based on regression testing (has the value changed since we last ran). Combined with the unit tests, we have some confidence, but it's by no means comprehensive!
Then 4 and 5 at whatever point we think we have enough testing. But I would suggest we could try upgrading, and then work backwards if we see any major differences, at least as an initial shot.
I guess this can be closed now? I made notes of the discussion in the project page.