xgcm/aerobulk-python

Better error handling

jbusecke opened this issue ยท 13 comments

During development on #3 I noticed that errors in the fortran code seem to lead in an immediate crash of the kernel. We should investigate how to catch and display these (I was not able to see the relevant message in a notebook, but did in an ipython session).

To reproduce I entered this (wildly wrong units) into ipython

import mod_aerobulk_wrap as aero
import numpy as np

ni=3
nj=6
calgo='ncar'
zt=2
zu=5
sst = np.full([ni, nj], 30.0)
t_zt = np.full([ni, nj], 20.0)
hum_zt = np.full([ni, nj], 20.0) # watch units here...
u_zu=np.full([ni, nj], 2.0)
v_zu=np.full([ni, nj], 0.0)
slp =np.full([ni, nj], 1000.0)
niter=1

ql,qh,tau_x,tau_y,evap = aero.mod_aerobulk_wrapper.aerobulk_model_noskin(calgo,zt,zu,sst,t_zt,hum_zt,u_zu,v_zu,slp,niter)

Which gives:

===================================================================
                    ----- AeroBulk_init -----

     *** Bulk parameterization to be used => "ncar"
     *** Cool-skin & Warm-layer schemes will NOT be used!
     *** Computational domain shape: Ni x Nj = 00003 x 00006
     *** Number of time records that will be treated:           1
     *** Number of iterations in bulk algos: nb_iter  =    1
     *** Filling the `mask` array...
 *** E R R O R :
 the whole domain is masked!
 check unit consistency of input fields

Might be relevant to this : https://numpy.org/doc/stable/f2py/signature-file.html#f2py-statements

Also a strong argument to have a pint unit check in this ๐Ÿ˜น @TomNicholas

I just read this from https://numpy.org/doc/stable/f2py/signature-file.html#f2py-statements

callstatement <C-expr|multi-line block>
Replaces the F2PY generated call statement to Fortran/C function with <C-expr|multi-line block>. The wrapped Fortran/C function is available as (*f2py_func).

To raise an exception, set f2py_success = 0 in <C-expr|multi-line block>.

It's pretty cryptic, but basically I think we need to add this to the signature file.

The way aerobulk handles errors is to call the ctl_stop function, defined here: https://github.com/brodeau/aerobulk/blob/03b40b2d48cc5ad4dc41b256b00c54a4d40d124e/src/mod_const.f90#L238-L278

This code calls the fortran STOP command. So the general problem we need to solve is how to pass a fortran STOP command up to python as an exception.

This SO issue is relevant: https://stackoverflow.com/questions/18021081/stop-python-code-in-fortran-module-error-using-f2py

I would start by writing a test for the expected exception.

This page is also very helpful https://github.com/pearu/f2py/wiki/FAQ2ed. They suggest we will need to modify the fortran source code ๐Ÿ˜’

This page is also very helpful https://github.com/pearu/f2py/wiki/FAQ2ed. They suggest we will need to modify the fortran source code

But what is solution 2, haha.

I am going to try my best to summarize the short hack session with @rabernat and @TomNicholas today:

The ultimate goal was to somehow enable us to call a python function from fortran (callback function) which then raises an error without a crashing python kernel.

We mostly followed these instructions.

We added this codeblock to mod_aerobulk_wrap.f90 (see also #11):

subroutine foo2(a)
    integer a
    if (a.gt.10) then
      print*, "Fortran: a>10 is bad, stopping here."
      call f2pystop(10) ! f2pystop must raise an exception that will trigger a long jump to the corresponding wrapper function
      stop (10)
    end if
    print*,"Fortran foo: a<=10 is good, continue."
  end subroutine foo2

and the following code to mod_aerobulk_wrap.pyf:

            subroutine foo2(a) ! in :m2:foo.f90
                integer :: a
                intent (callback) f2pystop
                external f2pystop
                integer status
                call f2pystop(status)
            end subroutine foo2

recompiled the code with pip install -e . and then ran:

from aerobulk.aerobulk.mod_aerobulk_wrap  import mod_aerobulk_wrapper as w

class F2PYSTOP(Exception):
    def __call__(self, status):
        raise self.__class__(status)

w.foo2(11, F2PYSTOP())

Contrary to the example this resulted in a rather gnarly error:

 Fortran: a>10 is bad, stopping here.
capi_return is NULL
Call-back cb_f2pystop_in_foo2__user__routines failed.
Fatal Python error: F2PySwapThreadLocalCallbackPtr: F2PySwapThreadLocalCallbackPtr: PyLong_AsVoidPtr failed
Python runtime state: initialized
Traceback (most recent call last):
  File "<ipython-input-1-8ef22c6715ac>", line 3, in __call__
__main__.F2PYSTOP: 10

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, aerobulk.aerobulk.mod_aerobulk_wrap (total: 14)
[1]    15315 abort      ipython

We experimented a bit further and changing

stop (10)

to

return (10)

in the first code block enabled us to run this without crashing the python kernel. @rabernat was hopeful that we might be able to do something with the return values, but we need to continue working on this.

I found this, which I believe is the prequel to our source used above.

  1. I don't think that you need to deal with longjmp in Fortran. You can replace
    Fortran STOP statement with a call to C function that will deal with the
    longjmp stuff.

It seems like some more brute force approaches (e.g. replacing STOP statements in all fortran files) as part of the build could be considered if we find a way to use return?

Notes from our hack on 2022/5/4:

  • We managed to both make the call to the main function threadsafe and raise an error without python crashing by injecting c code like this to the signature file:
!    -*- f90 -*-
! Note: the context of this file is case sensitive.

python module RANGE_CHECK ! in 
    interface  ! in :RANGE_CHECK
        module range_mod ! in :RANGE_CHECK:range_check.f90
            subroutine range_check(a) ! in :RANGE_CHECK:range_check.f90:range_mod
                integer :: a
                !intent (callback) f2pystop
                !external f2pystop
                !call f2pystop(a)
                !threadsafe
                callstatement {{ &
                    Py_BEGIN_ALLOW_THREADS &
                        (*f2py_func)(&a); &
                    Py_END_ALLOW_THREADS &
                    PyErr_SetString(PyExc_ValueError, "Oooops"); &
                }}
            end subroutine range_check
        end module range_mod
    end interface 
end python module RANGE_CHECK

! This file was auto-generated with f2py (version:1.22.3).
! See:
! https://web.archive.org/web/20140822061353/http://cens.ioc.ee/projects/f2py2e

the corresponding fortran file is :

MODULE RANGE_MOD
  IMPLICIT NONE
  PUBLIC :: RANGE_CHECK

CONTAINS
  subroutine RANGE_CHECK(a)
    INTEGER :: a
    if (a.gt.10) then
      print*, "Fortran: a>10 is bad, stopping here NEW2."
      !call f2pystop(a) ! f2pystop must raise an exception that will trigger a long jump to the corresponding wrapper function
      error_raised = 1
      RETURN
    end if
    print*,"Fortran foo: a<=10 is good, continue."
  end subroutine RANGE_CHECK

  subroutine ctl_stop( cd1 )
    CHARACTER(len=*), INTENT(in) :: cd1


END MODULE RANGE_MOD

This will raise an error like such:

pytest -s test.py
==================================================================== test session starts =====================================================================
platform darwin -- Python 3.10.4, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/juliusbusecke/Code/fortran_error_test
collected 2 items

test.py  Fortran: a>10 is bad, stopping here NEW2.
F Fortran foo: a<=10 is good, continue.
F

========================================================================== FAILURES ==========================================================================
______________________________________________________________________ test_range_error ______________________________________________________________________

    def test_range_error():
        # range_mod(12, raise_error)
>       range_mod(12)
E       ValueError: Oooops

test.py:15: ValueError
______________________________________________________________________ test_range_good _______________________________________________________________________

    def test_range_good():
        # range_mod(12, raise_error)
>       range_mod(2)
E       ValueError: Oooops

test.py:21: ValueError
================================================================== short test summary info ===================================================================
FAILED test.py::test_range_error - ValueError: Oooops
FAILED test.py::test_range_good - ValueError: Oooops
===================================================================== 2 failed in 0.36s ======================================================================
(fortran_error_test)

for a test script:

import pytest
from RANGE_CHECK import range_mod

errormsg = 0


def raise_error(msg):
    global errormsg
    print(f"raising error message {msg}")
    errormsg = msg


def test_range_error():
    # range_mod(12, raise_error)
    range_mod(12)
    assert errormsg == 12


def test_range_good():
    # range_mod(12, raise_error)
    range_mod(2)
    assert 1 == 1

What we need to do next:

  • Figure out a way to raise this error only conditionally (if the fortran code has actually failed!). This probably requires us to inspect the outputs of the fortran function in some way from c within the callstatement block. The easiest thing I can think of would be to just have an integer return value that we check. I think the outputs are somehow inside &a.
  • Optional: figure out how to pass the fortran error messages to python. Right now they are printed to stderr. Options.
  • Once we figure all this out, we need to make it work with aerobulk. The easiest way would be to somehow intercept the ctl_stop subroutine and replace it with our own custom subroutine that, instead of calling STOP, sets this error flag and calls RETURN. (Does RETURN just terminate the innermost subroutine or the whole fortran program?)

Steps to reproduce results with above files:

  1. Build initial signature file (only do once, because we modify this): python -m numpy.f2py range_check.f90 -m RANGE_CHECK -h range_check.pyf --overwrite-signature
  2. Build extensions using f2py: python -m numpy.f2py --verbose -c --f90flags="-fdefault-real-8 -ffree-line-length-200 --std=gnu" range_check.f90 range_check.pyf
  3. Optional: Build c wrappers for inspection: python -m numpy.f2py --verbose -m range_check.f90 range_check.pyf

Epic session today.

๐Ÿ˜Ž๐Ÿค“

We just discovered that @brian-rose is an expert on f2py! (See https://github.com/brian-rose/minimalf2py).

Brian, do you have any experience on error handling from Fortran. The above thread is very long and probably confusing, but basically we are trying to redirect Fortran STOP commands to raise Python exceptions (rather than crashing the entire session). Have you ever dealt with that situation before?

We just discovered that @brian-rose is an expert on f2py! (See https://github.com/brian-rose/minimalf2py).

... yikes. I am now about to be unmasked as a non-expert.

Unfortunately no, I've never dealt with this. To my knowledge, none of the fortran code I've ever tried to wrapped used STOP commands. I'll follow your progress here though!

If there's a good general solution (perhaps involving modifying the signature file, as you discussed above), then that might make a nice addition to minimalf2py

So I just realized that omitting all the error checking in AEROBULK_INIT will lead to a segfault (see 2536b44). This is not helpful for the user, but it at least shows up as failed in the CI!

I think we should just bite the bullet and to the range_check in the numpy code (where we can raise useful errors). I have touched upon this in a bit more detail in #36 (comment)

I now mark this as abandoned and close it due to the changes introduced in #43