Allow Ffi calls to be marked as potentially blocking / exiting the isolate.

Question

Allow Ffi calls to be marked as potentially blocking / exiting the isolate.

mkustermann opened this issue a year ago · 5 comments

Some users are running into an issue where many isolates are calling out to C code that will then block. This can cause the dart app to no longer work due to our limitation on maximum number of threads that can be active in an isolate group at a given point in time.

The limitation is there to avoid too many threads executing Dart code at same time. This can lead to situations where X threads all have TLABs which may contain unallocated memory, but the X+1 thread tries to obtain TLAB and fails, which will cause it to trigger GC (despite other thread's TLAB still having unallocated memory)
=> Allowing unbounded number of threads to enter an isolate group can lead to excessive triggering of GCs (despite free memory in other thread's TLAB)

See runtime/vm/heap/scavenger.h for the current calculation of the limit:

  // The maximum number of Dart mutator threads we allow to execute at the same
  // time.
  static intptr_t MaxMutatorThreadCount() {
    // With a max new-space of 16 MB and 512kb TLABs we would allow up to 8
    // mutator threads to run at the same time.
    const intptr_t max_parallel_tlab_usage =
        (FLAG_new_gen_semi_max_size * MB) / Scavenger::kTLABSize;
    const intptr_t max_pool_size = max_parallel_tlab_usage / 4;
    return max_pool_size > 0 ? max_pool_size : 1;
  }

We may consider adding a boolean flag to specify that a FFI call may be blocking / should exit the isolate.

// Static binding
@Native("sleep", exitIsolate: true)
external void sleep(int seconds);

// Dynamic binding
dylib.lookup().asFunction<...>(exitIsolate: true);

to automatically exit and re-enter the isolate to avoid custom C code like this:

auto isolate = Dart_CurrentIsolate();
Dart_ExitIsolate();
<... run blocking C Code, e.g. sleep() ...>
Dart_EnterIsolate(isolate);

See motivating use case: #51254

Answer 1 · 2023-02-06T12:06:51.000Z

The underlying issue is that new space (which we require for bump-allocation) doesn't scale with number of threads. The fact that this limitation would surface to the FFI API does seem a bit iffy.

We could device a scheme where FFI calls will give up their TLAB on transitions to C and re-acquire on the way back and limit the number of outstanding TLABs instead of number of active isolates. Though that would make transitions more heavyweight, would make returning from C (as well as Dart C API calls) possibly blocking for arbitrary amount of time. Seems less than ideal.

/cc @rmacnak-google

Answer 2 · 2023-02-07T21:00:51.000Z

I think this can be for free for the uncontented case: What we could do is when a new mutator wants to enter the isolate and the limit has been reached, we can check if any existing mutators are in an ffi-exited safepoint state, CAS its safepoint state to one meaning it has been kicked out, causing the safepoint transition on the ffi-return to hit the slow path, and take its TLAB away. The safepoint transition slow path then has a new check if it needs to wait on the mutator count to re-enter as a mutator.

Answer 3 · 2023-02-20T06:58:05.000Z

Though that would make returning from C (as well as Dart C API calls) possibly blocking for arbitrary amount of time.

It would still be compatible with Dart's semantics of synchronous code on an isolate running to completion before any other code is run on that isolate.

However, it would change the scheduling which isolate runs when we have exhausted the max number of mutators in an isolate group, that might be surprising. Do we have some kind of scheduling logic for that? @mkustermann

Answer 4 · 2023-04-03T08:55:32.000Z

I think this can be for free for the uncontented case: What we could do is when a new mutator wants to enter the isolate and the limit has been reached, we can check if any existing mutators are in an ffi-exited safepoint state, CAS its safepoint state to one meaning it has been kicked out, causing the safepoint transition on the ffi-return to hit the slow path, and take its TLAB away. The safepoint transition slow path then has a new check if it needs to wait on the mutator count to re-enter as a mutator.

That's an interesting idea.

I'm a little worried that doing this blindly can lead to situations where e.g. Flutter UI isolate does a FFI call, then another thread kicking the UI isolate out. When the FFI call on UI isolate returns it will take the slow path and block (which could freeze flutter UI).

This can also happen to some extent today as well - but only at event loop boundary (e.g. Flutter UI isolate is idle, N threads enter isolate group and then flutter UI isolate cannot enter anymore but has to wait).

If one mutator has been kicked out and returns from ffi call then in the slow path it should be allowed to kick out another thread if it's in a ffi call. That would mean the system would work flawlessly irrespective of number of threads - as long as there are not more than N threads executing Dart code concurrently (which may be an ok restriction as all dart code being executed will either go back to event loop or do ffi call eventually which are yield poitns). Though it will require some synchronization on both sides:

Calling native via ffi: If there's another thread waiting to execute Dart we need to notify it (could use similar mechanism as our existing "gc-safepoint-requested" bit which forces GeneratedToNative to slowpath)
Return from ffi call: Safepoint may have been stolen from the side, so we have to take slowpath and wait for mutator count (or kick another thread out if there's any in ffi-exited state)

Answer 5 · 2023-04-03T15:48:45.000Z

Hi. I have a complaint about this. If we're going to expose to FFI developers to those kinds of Isolate details, why we can't have some way for native code to at least return (synchronously) Dart objects that can be created through the Dart_CObject struct?