gap-system/gap

volatile's to be used in libgap-calling functions?

dimpase opened this issue · 8 comments

Observed behaviour

Crash upon calling a GAP function with wrong number of arguments via GAP_CallFunc3Args, (e.g. Sum(1,2,3))
as described in sagemath/sage#37026
and sagemath/sage#37951 - and also
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872

Expected behaviour

normal return with a GAP error

GAP etc. versions

GAP 4.13.0, GAP 4.12.2, gcc 13.2.1 (important - not reproducible with other compilers), Python 3.12

From assembly code GCC people (see GCC link above) established that a parameter is held in a register (something that can be worked around by compiling with -O0), and, naturally, after GAP's longjmp, it gets trashed, etc.

Do you see this happening with other uses of libgap, or SageMath is a pioneer here?

Never seen anything like this before. Reading the gcc bugzilla, the issue seems to be incorrect longjmp/setjmp in sage itself, needing more volatile? Or did I read it wrong?

Never seen anything like this before. Reading the gcc bugzilla, the issue seems to be incorrect longjmp/setjmp in sage itself, needing more volatile? Or did I read it wrong?

My understanding is that there some arguments of GAP_CallFunc3Args are put into a non-volatile list, and
in particular this list holds ref-counted Python pointers to GAP's Obj data.
Then GAP_CallFunc3Args is used to call something that throws a GAP error, e.g. Sum(1,2,3).
As this involves longjmp by GAP, one of these pointers, held in a register, is invaludated by longjmp, and an attempt to decrease the corresonding refcount leads to a crash.

IMHO it might be a defect of Sage's interface (hello, side-effects :-)), which always creates a refcounted Python object to hold a GAP Obj. But this is not necessary for parameters passed, by value, to GAP functions.

@embray - have you encountered this?

see also https://trofi.github.io/posts/312-the-sagemath-saga.html - it does say that Sage needs more volatile, but not that anything else is amiss.

Yes, this is just normal GAP rules, you need lots more volatile. Might be tricky to do with pyx, not sure if it is making variables behind the scenes you can't attach volatile to?

The best option is probably to make a small GAP wrapper which does the GAP call, and setjmp / longjmp, and then returns some method of identifying an error occurred. That's not a hard function to write, but it would need an API designing.

Yes, this is just normal GAP rules, you need lots more volatile. Might be tricky to do with pyx, not sure if it is making variables behind the scenes you can't attach volatile to?

For each GAP object, Sage's interface creates/updates a refcount, with a pointer pointing to the object - to prevent GAP's GC from collecting it prematurely.

But often one can get away without it, as far as I understand. E.g. we can hold a GAP object in a local variable,
and as it goes out of scope, it becomes collectable.
E.g. as far as I understand, the following is a correct libgap-calling C code.

Obj foo(int bar) {
  Obj gbar;
  gbar = GAP_NewObjIntFromInt(bar);
  return GAP_FooBar(gbar);
}

Correct - in the sense that it does not create any uncollectable garbage, even if one of GAP_... calls
throws an exception, and it does not crash due to GC kicking in in the
wrong moment (eating up gbar before it's used in GAP_FooBar(gbar)).

Right? My preliminary tests with a modified Sage interface, which optionally skips doing a refcount for particular GAP objects, indicates that no crashes happen this way.

I'd like to add more examples along these lines in GAP's tst/testlibgap/ - also ones where a longjmp from GAP is getting caught. (In my tests, in Cython, GAP's exceptions are getting caught - not in the example above, though).

The best option is probably to make a small GAP wrapper which does the GAP call, and setjmp / longjmp, and then returns some method of identifying an error occurred. That's not a hard function to write, but it would need an API designing.

As I understand it, the issue isn't anything to do with libgap, or GAP at all. It's all about longjmp, and the following text from longjmp's docs. We could write some tests, nothing wrong with that, but I would expect they'll be fine, as long as they follow the longjmp rules (from man longjmp):

the values of automatic variables are unspecified after a call to longjmp() if they meet all the following criteria:

• they are local to the function that made the corresponding setjmp() call;
• their values are changed between the calls to setjmp() and longjmp(); and
• they are not declared as volatile.

As I understand it, the issue isn't anything to do with libgap, or GAP at all

It has to do with GAP's GC, as longjmp was breaking Sage's manual management of GC-able GAP objects - specifically, decrementing refcount/destroying the pointer if refcount reached 0.

• their values are changed between the calls to setjmp() and longjmp();

this bullet actually is not 100% correct, IMHO, as it includes the case of an automatic variable being held in a register (as an optimising compiler might arrange for - and actually does in this case). Then "changed" means "reused", as often happens to registers. One of these values, to be used by refcount processing in longjmp recovery, is held in a register, and gets invalidated by longjmp.

I don't see anything we could do here. As far as I can tell, the issue is purely on the caller side: code using GAP_Enter must be careful about any additional local variables -- those most be either volatile, or there must be no access to them in the code path taken when a longjmp occurred (i.e. when handling an error).

It is technically impossible to add those volatile qualifiers on our side, client code must take care of that.

I have no idea if this is applicable in your code, but: It may help to keep uses of GAP_Enter isolated to their own little helper functions. I.e. write your code as

    ok = GAP_Enter();
    if (ok) {
        callSubDoingActualWork(args);
    }
    GAP_Leave();

which minimizes the need of inserting volatile (as long as callSubDoingActualWork is not inlined, at least). But of course in your case you don't control the code fully, Cython generates it, so no idea if that can be turned into a solution for SageMath.

That said, I see nothing we can do. But if you find a fix that requires changes in libgap after all, of course don't hesitate to submit a PR.