We need to be more explicit about the Cats Effect memory model
armanbilge opened this issue · 35 comments
Currently the Cats Effect memory model is implicitly defined via implementation details and the Java Memory Model (JMM). In some cases this works okay but in other cases this means that our memory model is effectively undefined.
As a concrete example, earlier today on Discord there was a report that Semaphore
offers no memory ordering guarantees when acquiring/releasing 0 permits. This boils down to a specific implementation detail that doesn't matter if you are using CE's own concurrency primitives and data structures, but matters a lot if you are reading/writing unsynchronized memory unsafely with delay
. We can't say if the current Semaphore
implementation is bugged or not without explicitly defining the expected semantics.
With Scala Native multi-threading on the horizon in #3662, nailing down the specifics of the CE memory model becomes even more important since we have a lot more flexibility in implementation details. Longer-term, Scala Native could potentially even support weaker memory models than the JMM.1
Footnotes
-
Tangential, but I have a hypothesis that code written using immutable data structures and CE concurrency primitives actually does not need many guarantees of the JMM (e.g. publication of
final
fields on construction) and that a weaker CE memory model would be sufficient and have better performance esp. on ARM. ↩
I think it's hard to do this right, but probably worth it. Even a somewhat informal description is probably better than nothing (which is the current situation).
One of the things which makes this hard is that (as far as I know), the Scala memory model is also "implicitly defined via implementation details and the Java Memory Model".
Another thing we should think about is whether to define these rules for the typeclasses in kernel
, or only for IO
. Defining them for the typeclasses is more useful (I think), but that would also mean that we create new rules that every implementer (ZIO, Monix(?), ...) must follow.
I know it's just the example which triggered this issue, but I'd be interested in what guarantee an acquire(0)
or release(0)
is expected to provide. As far as I know, a semaphore (intuitively) provides happens-before for corresponding release/acquire pairs. If we're rel/acq-ing zero permits, these are not paired with each other...
(All these concerns doesn't mean we shouldn't do this. I think we should, I just started to think about how to do it...)
I know it's just the example which triggered this issue, but I'd be interested in what guarantee an
acquire(0)
orrelease(0)
is expected to provide. As far as I know, a semaphore (intuitively) provides happens-before for corresponding release/acquire pairs. If we're rel/acq-ing zero permits, these are not paired with each other...
For comparison, here's what the JDK Semaphore promises:
Memory consistency effects: Actions in a thread prior to calling a "release" method such as release() happen-before actions following a successful "acquire" method such as acquire() in another thread.
The OP mentioned that it works even with 0 permits.
Memory consistency effects: Actions in a thread prior to calling a "release" method such as release() happen-before actions following a successful "acquire" method such as acquire() in another thread.
I don't think this part of the JDK apidoc (which is usually pretty good) is worded as carefully as it should be. By that, I mean that it is not true as written. Let's say we have s = new Semaphore(2)
; and one thread does s.acquire(1); foo(); s.release(1)
; and another thread does s.acquire(1); bar(); s.release(1)
. By a literal reading of the sentence above, foo()
happens-before bar()
; which is not true (in reality there is no specific ordering between them, as they could use "different" permits); also, accoding to the same sentence, bar()
also happens-before foo()
, which is a contradiction.
If we look at the case of monitors in JLS 17.4.5, this is what it says:
An unlock on a monitor happens-before every subsequent lock on that monitor.
The word "subsequent" seems important here (and I suspect that's what's missing from the semaphore apidoc). It is defined formally in 17.4.4. As far as I understand, a consequence of that definition is essentially the same as intuitively thinking in rel-acq pairs (or rel-acqs corresponding to each other). So there is happens-before, if the release and the acquire correspond to each other. (At least that's how I understand it.)
And I think this is where there is a problem with release(0)
and acquire(0)
: they never correspond to each other (due to zero permits). ("Correspond" or "paired with" are obviously not precise enough to have in the spec, but I think it helps thinking about the issue.)
Another thing we should think about is whether to define these rules for the typeclasses in
kernel
, or only forIO
. Defining them for the typeclasses is more useful (I think), but that would also mean that we create new rules that every implementer (ZIO, Monix(?), ...) must follow.
That's a really great point. Particularly Ref
is the important building block: until we specify its memory effects, we are helpless to say anything about the semantics of a Concurrent
-based Semaphore
.
But that's good though, I think it gives a concrete place to start from.
The other key semantic that jumps to mind is about memory visibility when you have sequential delay(...) *> delay(...)
but each of those delay blocks runs on a different thread.
I'd also like to figure out where and how we are currently relying on these implicit semantics ourselves.
Particularly Ref is the important building block: until we specify its memory effects, we are helpless to say anything about the semantics of a Concurrent-based Semaphore.
Yeah. The rule could be the CE equivalent of this from the JLS: "A write to a volatile field happens-before every subsequent read of that field." (And of course this holds for the default Ref
, due to using an AtomicReference
.)
The other key semantic that jumps to mind is about memory visibility when you have sequential delay(...) *> delay(...) but each of those delay blocks runs on a different thread.
And this could be the CE equivalent of this. "If x and y are actions of the same thread and x comes before y in program order, then hb(x, y)." Obviously with fiber instead of thread, and "something" instead of program order. (As far as I can tell, this also holds currently due to Executor semantics. Of course we should double check that the WSTP follows correct Executor semantics, but it should.)
Similarly: we also have an equivalent of "A call to start() on a thread happens-before any actions in the started thread", when forking a fiber. And of "All actions in a thread happen-before any other thread successfully returns from a join() on that thread", when joining a fiber.
We should have something for async
(or rather cont
). Like: calling the callback happens-before the suspended fiber continuing. (I think this is also true currently.)
A little while back Daniel S and I chatted about this on Discord (sorry public internet) and had a follow-up conversation.
In those discussions, I propose that most Cats Effect applications and libraries actually do not rely on the JMM's final field semantics.1 In idiomatic Cats Effect code, data is explicitly shared via Ref
and Deferred
(or higher-level concurrent structures like Queue
) and thus can be properly synchronized. This is important because publication of final fields is expensive on the JVM and on Scala Native, especially on ARM.
Continuing from there, I also want to hone down exactly what memory is published when synchronizing via Ref
and Deferred
. For example, they could create a full memory fence, but I argue that this is not necessary especially for idiomatic Cats Effect code which never uses var
s etc.2 Instead, we only need to publish memory that is "structurally reachable" from the pointer we just published via the Ref
/Deferred
.
It turns out this is an existing concept in C++, known as memory_order_consume
.
The
consume
memory order is a weaker form ofacquire
. Whereasacquire
prevents all future memory accesses from being ordered ahead of the load, theconsume
order only prevents dependent future memory accesses from being reordered ahead of the load.
https://devblogs.microsoft.com/oldnewthing/20230427-00/?p=108107
The JMM also uses some notion of reachability to define its final field semantics.
... when the object is seen by another thread, that thread will always see the correctly constructed version of that object's
final
fields. It will also see versions of any object or array referenced by those final fields that are at least as up-to-date as thefinal
fields are.
https://docs.oracle.com/javase/specs/jls/se21/html/jls-17.html#jls-17.5
Note, I'm not claiming that we can implement strictly these semantics. For example, final field semantics are not going away on the JVM and LLVM doesn't currently expose memory_order_consume
. It's just about what we are explicitly promising in our API. Meanwhile, our implementation details will end up offering stronger guarantees in practice than what is actually specified by our memory model. For most developers this won't matter.
Anyway, the reason for bringing this up again is because the possibility of relaxing the memory model on Scala Native has finally cropped up :)
Footnotes
-
The JMM's final field publication semantics are important when you are not using explicit synchronization, leaving memory visibility subject to data race. For example, we use this technique to support stealing of timers and packing a linked list. In these data race scenarios, the final field semantics are useful to guarantee that you see a consistent view of an object. ↩
-
Code that is working with
var
s (or other mutable memory) must already have aSync
constraint or greater and is arguably "low-level". In which case, that code is fully capable of usingAtomicReference
(instead ofRef
) or manually fencing to achieve the desired memory effects. ↩
Just a note, memory_order_consume
in C++ seems to be a mess (well... what isn't? ;-).
It is widely accepted that the current definition of
memory_order_consume
in the standard is not useful. All current compilers essentially map it tomemory_order_acquire
.
From P0371R1: Temporarily discourage memory_order_consume
That's from 2016, but newer notes are not much more encouraging:
The current definition of
memory_order_consume
is not sufficient to allow implementations to map a consume load to a "plain" load instruction (i.e. without using fences), despite this being the intention of the original proposal. Instead,memory_order_consume
is typically treated as though the program had specifiedmemory_order_acquire
...
From P0735R1: Interaction of memory_order_consume
with release sequences
Also interesting reading: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html
Why would final field semantics matter when we specify and/or implement Ref
and Deferred
? They seem to be orthogonal to me. (As you say, "final field publication semantics are important when you are not using explicit synchronization", which Ref
/Deferred
is.)
(On the other hand, final field semantics absolutely matter to scala-native.)
Why would final field semantics matter when we specify and/or implement
Ref
andDeferred
? They seem to be orthogonal to me.
Another way we could define the CE memory model is that we could say that Ref
and Deferred
publish their own value (a pointer to an object) but have no other memory effect (I believe this is memory_order_relaxed
). This model would still be useful so long as users could rely on final field semantics to safely publish pointers to immutable objects via Ref
and Deferred
.
Huh, yeah, that makes sense. Basically, even if we'd make Ref
useless, it wouldn't be completely useless 😄
Cross-linking my comment on the Scala Native thread.
So I've skimmed the "All Fields are Final" article again, and the cost doesn't seem so dramatic to me. The conclusion seems to be that on x86 it is practically free, and on ARM it is also not too bad. It probably helps, that hotspot can somehow avoid the acq fence on reads. (Do we know why scala-native can't do that?)
(Btw, I think what hotspot does with eliding the acquire is essentially what memory_order_consume
tries to be.)
But of course, if we can make it faster on ARM, that's always good. (I'm just afraid of immutable objects changing... that's not fun...)
It probably helps, that hotspot can somehow avoid the acq fence on reads. (Do we know why scala-native can't do that?)
Hmm, what are you referring to exactly? This bit?
C tc = c; [LoadLoad] r1 = tc.c;
...
Turns out, most hardware also respects the order of so-called 'dependent' reads, and hence does not require emitting the barrier there.
If I understand correctly, I wouldn't describe this as avoiding the acquire fence. Rather, there is semantically an acquire there, that just happens to require emitting no instructions on most hardware.
But you did exactly nail the question I've been pondering today. Which is whether the implementation of final
Field Semantics that I suggested in scala-native/scala-native#3610 (comment) is actually correct 😅
More broadly, my understanding was that for release
to be useful it must always be paired with an acquire
. This is straightforward enough when you are loading/storing at the same address. But in this case, we want to pair a release
fence (at the end of the constructor) with a load acquire
(of the final field). The C++ reference describes a fence-atomic synchronization mechanism corresponding to our situation.
A release fence F in thread A synchronizes-with atomic acquire operation Y in thread B, if
- there exists an atomic store X (with any memory order),
- Y reads the value written by X (or the value would be written by release sequence headed by X if X were a release operation),
- F is sequenced-before X in thread A.
I found a nice illustration of this in these slides.
In the context of final
Field Semantics, I suppose b
would be the final
field and a
would be the reference to the constructed object. Notice how a
is stored with relaxed
ordering, which is stronger than from an ordinary, non-volatile
write on the JVM (LLVM calls these monotonic
and unordered
, respectively). This StackOverflow answer also discusses some of the differences; IIUC relaxed
is similar to Java volatile
(i.e. cannot be cached) but with no other memory effect.
So this is where I'm getting a bit puzzled. It's quite possible that the JVM, C++, and LLVM are just not corresponding to each other in such a direct or obvious way. It's also possible that on most platforms there is no practical distinction between a relaxed
store and an ordinary store.
But the question remains what is the correct way to implement final
Field Semantics on Scala Native 🤔
Regarding performance on JVM vs Scala Native ... the "All Fields are Final" article discusses barrier coalescing on HotSpot. It's possible that Native is unable to do this kind of optimization. If it did, it would have to be something in LLVM's optimizer (didn't see any hits in a quick search). So this could explain why the performance impact is much worse on Native.
If I understand correctly, I wouldn't describe this as avoiding the acquire fence. Rather, there is semantically an acquire there, that just happens to require emitting no instructions on most hardware.
You're right. Or rather, semantically there is a consume there; because I think they can avoid a CPU-fence instruction exactly because it's a consume-like situation (i.e., they don't need a general-purpose acquire).
Regarding how finals are implemented in scala-native: is your concern that (1) the release is not paired correctly with the acquire, because they're for different memory addresses; or that (2) the object reference may be written with a plain store instead of a relaxed one? (Or both?)
(Now we're talking about something which I only have a vague understanding of, so don't take me too seriously.)
About (1): I think this is a valid concern. It seems that the acquire fence will be in the wrong place. If I understand it correctly, this is what happens now:
// constructor:
this.finalField = something
releaseFence()
// using the object:
val obj = ...
val x = obj.finalField
acquireFence()
While the constructor seems correct, at the use site what should happen is more like this:
// using the object:
val obj = ...
acquireFence()
val x = obj.finalField
About (2): I think C++ relaxed approximately corresponds to opaque, which is stronger than plain. But, there is this (here, emphasis mine):
Instead of
X.setRelease(this, v)
, you can use Opaque mode (or Plain mode if x is primitively bitwise atomic), preceded with a releaseFence:VarHandle.releaseFence(); x = v;
. And similarly, instead ofint v = X.getAcquire(this)
, you can follow a Plain or Opaque mode access with an acquireFence:int v = x; VarHandle.acquireFence()
.
Are object references primitively bitwise atomic in scala-native? They are on the JVM. (But when Valhalla value types arrive, those might not be!) If they are also in scala-native, (2) seems fine to me. (But honestly, I don't really understand the difference between LLVM monotonic and unordered...)
Regarding what's implementable and what is not (when we're implementing, e.g., Ref
): I assume on scala-native we could do whatever LLVM allows(?). On the JVM: if we let go of JVM 8, we have a lot of toys to play with. In fact, we can have most of the things C++ has 😄 (with the notable exception of consume).
Or rather, semantically there is a consume there; because I think they can avoid a CPU-fence instruction exactly because it's a consume-like situation (i.e., they don't need a general-purpose acquire).
Actually it is even simpler than a consume
-like situation because of the final
ity of the field. The sample_consume_disallowed()
example on this blog post illustrates that consume
disallows speculation of finalField
prior to loading obj
, because of the dependency. In our case, speculation is ok: because finalField
is final
, any values obtained by speculation prior to the load must be consistent with the value after the load.
Attempting to illustrate more concretely, consider:
val obj1 = getObjUnordered()
val obj2 = getObjUnordered()
val x = obj2.finalField
It would be okay for that code to be re-written to:
val obj1 = getObjUnordered()
val speculation = obj1.finalField
val obj2 = getObjUnordered()
val x = if (obj2 eq obj1) speculation else obj2.finalField
Contrast with:
val obj1 = getObjUnordered()
val obj2 = getObjConsume()
val x = obj2.finalField
In this case, that rewrite is not allowed.
It seems like final
Field Semantics are working without barriers because "most" hardware works that way. But I'm not sure which hardware 😅
Breaking this order is not directly possible on runtime side, because it will violate the data dependencies. ... Turns out, most hardware also respects the order of so-called 'dependent' reads, and hence does not require emitting the barrier there.
About (1): I think this is a valid concern. It seems that the acquire fence will be in the wrong place.
Ugh, you are right, it is in the wrong place. And actually I had completely missed this.
Are object references primitively bitwise atomic in scala-native?
Not sure. I hope so 😅 object references are modeled as LLVM pointers and I can't find any LLVM-related hits for those keywords. They probably call it something else ...
But honestly, I don't really understand the difference between LLVM monotonic and unordered...
I agree, it seems to be (approximately) equivalent to the JVM's opaque
and plain
.
In any case, thanks, that section you quoted is helpful. Still don't feel like I fully grok anything 🤔 It definitely does seem like the acquire
for loading final
fields is in the wrong place and that we probably don't need it in practice. The combination of the data dependency and final
ity seems sufficient to achieve the semantics we want ... on "most hardware" 😅 I'm going to stare harder at this for a while.
I assume on scala-native we could do whatever LLVM allows(?).
On Scala Native we can do whatever NIR (Scala Native bytecode) allows. In practice, this corresponds closely to LLVM, including for atomics.
In our case, speculation is ok: because finalField is final, any values obtained by speculation prior to the load must be consistent with the value after the load.
I don't think so. This whole game with fences is exactly because it is possible to read 2 different values from a final field. Namely, the null
before the initialization, and its initialized value.
It seems like final Field Semantics are working without barriers because "most" hardware works that way. But I'm not sure which hardware 😅
Apparently Alpha is not like "most" hardware:
Without rcu_dereference(), DEC Alpha can load a pointer, dereference that pointer, and return data preceding initialization that preceded the store of the pointer.
From https://www.kernel.org/doc/Documentation/RCU/rcu_dereference.txt (which is about Linux RCU, which was apparently the motivation for consume).
But also, the CPU is not the only one who can reorder things: the compiler(s) can do that too.
Without memory order consume, both the compiler and (again, in the case of DEC Alpha) the CPU would be within their rights to carry out aggressive data-speculation optimizations that would permit readers to see pre-initialization values in the newly added data elements.
From https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0098r0.pdf.
So I think final fields work without (read) barriers in Hotspot, because (1) there is no Hotspot for Alpha (probably?), and (2) Hotspot knows, that neither Hotspot, nor the Java compiler makes any optimization which would break them. (I think (2) is equivalent to saying that there is a compiler read barrier for final fields.)
Btw., interesting (although long, and at points very confusing) discussion about consume: https://news.ycombinator.com/item?id=36040457
Thanks, I found jcranmer's comments in this thread of the HN post very helpful.
So I think final fields work without (read) barriers in Hotspot, because (1) there is no Hotspot for Alpha (probably?), and (2) Hotspot knows, that neither Hotspot, nor the Java compiler makes any optimization which would break them. (I think (2) is equivalent to saying that there is a compiler read barrier for final fields.)
Okay, now I see what you are saying. I agree.
For the record, I have a hard time seeing what kind of optimizations a compiler could make at the use-site of a final
field that allow it to see uninitialized values, without violating the data dependency. Of course the limits of my imagination don't count for anything in practice 😅
Unfortunately this is bad news for Scala Native where we don't have fine control over the optimizations that LLVM makes. Or rather, emitting these barriers is how we control the optimizations.
I'm worried that to correctly implement final
Field Semantics it must:
- emit a
release
fence at the end of the constructor of a class withfinal
fields - store instances with
relaxed
ordering - load instances with
acquire
ordering
Since so much code is generic, eliding the barriers for stores/loads is difficult unless you know at linking time that no subclasses require final
Field Semantics.
- store instances with relaxed ordering
Doesn't scala-native need this anyway? For every access to shared memory, since data races are undefined behavior (on LLVM).
For every access to shared memory, since data races are undefined behavior (on LLVM).
Scala Native uses unordered
for this.
It essentially guarantees that races produce somewhat sane results instead of having undefined behavior. ... This is intended to match the Java memory model for shared variables.
Ah, okay. And that's not enough for the situation with final
s? Does it really need monotonic (which is apparently the same as relaxed)?
Does it really need monotonic (which is apparently the same as relaxed)?
Well, maybe not, based on what you shared in #3899 (comment).
Instead of
X.setRelease(this, v)
, you can use Opaque mode (or Plain mode if x is primitively bitwise atomic), preceded with a releaseFence:VarHandle.releaseFence(); x = v;
If we can confirm primitive bitwise atomicity of pointers, then I suppose an unordered
write would be sufficient.
I suppose an
unordered
write would be sufficient.
What bothers me is that even though that JDK9 Memory Order Modes guide suggests this would work, nothing in C/C++ world says that. If anything, they seem to say the opposite i.e. that it won't work 😕
It is possible to issue memory barrier without an associated atomic operation ... [but] still requires atomic operations to work as defined
We're getting very much offtopic here, but...
If anything, they seem to say the opposite i.e. that it won't work
Yeah. Except that's C++; what "we" (i.e., scala-native, really :-) care about is LLVM. I think unordered is fine for LLVM:
A fence A which has (at least) release ordering semantics synchronizes with a fence B with (at least) acquire ordering semantics if and only if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M (either directly or through some side effect of a sequence headed by X), Y is sequenced before B, and Y observes M.
From https://llvm.org/docs/LangRef.html#id219 (emphasis mine).
So, like C++, LLVM requires atomic operations (i.e., an atomic operation with any ordering constraint). Reading https://llvm.org/docs/LangRef.html#atomic-memory-ordering-constraints, it seems that the weakest one is unordered. So that should be fine.
Why the difference? I think it's simply because the weakest possible atomic operation in C++ is a relaxed one (which corresponds to LLVM monotonic). But LLVM introduces a weaker (but still atomic!) operation: unordered.
I think unordered is fine for LLVM ... LLVM introduces a weaker (but still atomic!) operation: unordered.
Thanks. Yes, I've been pondering the same. If true (and seems like it should be) that's very good ... however I no longer have a good intuition for what atomic actually means 😅 LLVM explains but I haven't really grokked it.
These operations are required to be atomic in the sense that if you use unordered loads and unordered stores, a load cannot see a value which was never stored. ... Some examples of unsafe optimizations are narrowing an assignment into a bitfield, rematerializing a load, and turning loads and stores into a memcpy call.
I just don't have a good enough understanding of how and why compilers apply those sorts of optimizations to unordered loads and stores and furthermore, why that would cause fences to have undefined behavior 🤔
We're getting very much offtopic here, but...
Ha, maybe 😅 I'll invoke All Fields Are Final in an attempt to defend this tangent:
Since memory models are great compromise about what users can expect, and what we are actually able to pull off, any semantics needs a plausible implementation.
i.e. understanding what we can implement and how is a prerequisite for defining the Cats Effect memory model. In particular, if it is impossible to efficiently implement final
Field Semantics on Scala Native (because we don't control the entire compilation pipeline the way that the JVM/Hotspot does) then we should be more aggressive about designing and implementing CE in a way that does not rely on those semantics.
In any case, thanks for your patient and thoughtful engagement on this subject, much appreciated :)
At this point, I think I'll just say that atomic is whatever the spec says is atomic; and if another part of the spec (e.g., the fence
instruction) requires atomic, then anything that's defined to be atomic must be fine. 🤷♂️
But, speculating a little bit, a compiler might want to sometimes generate a 64-bit write as two 32-bit writes (e.g., on a 32-bit platform, or something). This would not be atomic (the "transforms a store into multiple stores" part); and obviously could cause problems with reading values which never were written. And if I'm dereferencing a pointer which never existed... that's obviously very bad. (And I think it's undefined behavior exactly because they want compilers to be able to freely do these kinds of transformations.)
Returning to CE a little bit...
if it is impossible to efficiently implement final Field Semantics on Scala Native (...) then we should be more aggressive about designing and implementing CE in a way that does not rely on those semantics
About designing (i.e., what we spec): it seems to me, that the only way we'd really rely on final
semantics is the one you mentioned earlier (when I asked), when Ref
/Deferred
has opaque (i.e., relaxed) semantics. To me that seems way too weak for a general purpose thing like Ref
. (I guess it could be useful for some kind of statistical counter, but beyond that...)
The next stronger possible way to spec them seems to be Ref
/Deferred
having release/consume semantics. If I understand correctly, you say that "idiomatic CE code" should be fine with that. I have some opinions on that, but if we accept that, then said code doesn't rely on final
semantics any more. (Since it publishes through Ref
/Deferred
, it could even have var
fields, as long as it only writes them in the constructor. It will be fine due to the release/consume of the Ref
/Deferred
.)
About implementing: CE is full of intentionally racy internal code where it relies on final
s. If it can't do that... it would need a lot of volatile
s (or AtomicReference
s, etc.). I don't think that would make anything faster... (But maybe you meant implementing Ref
/Deferred
specifically?)
About implementing: CE is full of intentionally racy internal code where it relies on
final
s. If it can't do that... it would need a lot ofvolatile
s (orAtomicReference
s, etc.). I don't think that would make anything faster
It might be helpful to talk about some specific examples. Here's one that I know of:
cats-effect/core/shared/src/main/scala/cats/effect/IOFiber.scala
Lines 164 to 165 in 97b0174
Currently this works via data race because the IO.Pure(outcome)
publishes its final field. If it didn't, you could join
a completed fiber and read an uninitialized null
instead of the outcome
.
If IO.Pure(...)
no longer publishes its fields, then we will have to mark @volatile var _join
. I don't think this should make anything slower:
- previously, there was a
release
in thePure
ctor and anacquire
prior to reading its value field - now, there is a
release
when writing_join
and anacquire
when readingjoin
Meanwhile, we removed the release/acquire barriers from all other uses of IO.Pure
. That should make things faster.
The next stronger possible way to spec them seems to be
Ref
/Deferred
having release/consume semantics. If I understand correctly, you say that "idiomatic CE code" should be fine with that. I have some opinions on that
Care to share? :)
It might be helpful to talk about some specific examples
Yeah, you're right:
-
The one you linked (
_join
): this is a good example; however,volatile
is stronger than rel/acq. Also, as we previously discussed, on the JVM the acquire forfinal
s is optimized to a nop :-) -
IOFiber#_cancel
is basically the same thing. -
IOFiber#captureTrace
reads the final fieldtracingEvents
. If it readsnull
, we get no trace in our liveFiberSnapshot; not the end of the world, but also not nice.
Yeah, ok, I admit this is not that much... I remembered (imagined?) more 🤷♂️
- If we count open PRs, there is also #3781. If I remember correctly, it counts on
Node#triggerTime
being aval
and always being correctly visible.
Not intentional races in CE, but possible races in user code (I guess we could say that these are forbidden...):
-
The WSTP is accessible (seen as an
ExecutionContext
) by user code which doesn't run on the WSTP. Itsexecute
callsscheduleExternal
, which readsval externalQueue
, which then readsval queues
inScalQueue
. Maybe it's not typical to use an executor obtained through a race, but I believe it is expected to work. (I mean... I'd expect executors to be thread-safe...) And the problem could be recursive: I think it is typical, to store an executor in afinal
field... -
A more general version of the previous point: are the thread-safe objects in CE expected to be really thread-safe? E.g., if user code obtains an
UnboundedAsyncQueue
through a race, andfinal
s "don't work", then its operations could fail with NPEs.
Not final fields, but data race (probably unintentional?):
- LocalQueue#getTotalFiberCount and others below it read plain
var
s which arelong
s; this is not atomic, and they can read values which were never written. (This is only used by theLocalQueueSamplerMBean
, so again, not critical, but still...)
Care to share? :)
I will, just need time to think it through (and write it down) properly...
So, let's imagine, that we say that idiomatic CE code should be fine with using Ref/Deferred and use only "structurally reachable" things. If I understand it correctly, this is approximately saying, that Ref#set
and Deferred#complete
have release semantics; and Ref#get
and Deferred#get
have consume semantics. (And Ref#modify
and friends have consume/release.)
I believe this would make totally reasonable code like this incorrect. In fact, it would make incorrect the whole idiom of using a Deferred[F, Unit]
to "wake up" someone, who should (re)try doing something with a Ref
. The one which "wakes up" the other, essentially does this:
state.set(newState); // release
deferred.complete(()); // release
While the other fiber:
deferred.get(); // consume a ()
state.get(); // consume a state; but this line could be reordered before the previous one!
...
So after waking up, it could read an "old" state (i.e., it doesn't necessarily sees newState
which was written by the first fiber).
however,
volatile
is stronger than rel/acq. Also, as we previously discussed, on the JVM the acquire forfinal
s is optimized to a nop :-)
Yes. Well, let's be more precise about what exactly we are discussing ...
On the JVM, final
Field Semantics aren't going away. We could define classes with var
s instead of val
s, but this might have other implications, and as you just described @volatile
(which uses the seq_cst
ordering) or even VarHandle
would probably have worse performance.
On Scala Native, final
Field Semantics will be configurable. Also, the current implementation of those semantics does rely on an acquire
for reading final
fields. And lastly, instead of clumsily applying @volatile
we can actually load/store with the precise orderings that we want.
Suffice to say, it seems less likely that changing our implementation on the JVM will bring any optimizations. But we have a lot more potential to do so on Scala Native.
IOFiber#captureTrace
reads the final fieldtracingEvents
. If it readsnull
, we get no trace in our liveFiberSnapshot; not the end of the world, but also not nice.
I think this one is different from _join
and _cancel
. tracingEvents
is initialized in the IOFiber
constructor and its value does not change. To read its value you need a reference to the IOFiber
itself, and in practice there are two ways you will obtain that reference:
- (indirectly) via a load with
acquire
ordering, which should also publishtracingEvents
- via data race, in which case 🤷 there were no visibility guarantees anyway
If we count open PRs, there is also #3781
Yes, definitely count this one :) I admit this one definitely needs some thinking about.
The Node#triggerTime
thing is annoying. If we assume that loading the uninitialized value always returns 0
maybe we could treat that as a special sentinel that's never a valid trigger time. But I'm not sure if that's safe to assume ...
And in fact there is another memory publication issue there that I've been worried about. The Node#callback
used to resume the IOFiber
runloop closes over state. It is obscured by the closure but we actually rely on final
Field Semantics to get that state published so that the callback works when invoked by a stealing thread.
cats-effect/core/shared/src/main/scala/cats/effect/IOFiber.scala
Lines 628 to 630 in 85d6f3c
Instead of encoding that callback as an anonymous closure, we could implement it as a top-level (static) class that explicitly takes all its state in a constructor. Then in the implementation of apply
we would code defensively against uninitialized state.
Putting aside those hacks, I think we could definitely make it work like this:
- emit a
release
fence prior to storing theNode
in the heap - if reading fields on the owner thread, use an ordinary load
- if reading fields on a stealing thread, emit an
acquire
fence before loading
Yes, I realize this is just replicating final
Field Semantics 😅 and maybe that's the moral of the story here. It's very difficult to escape synchronization.
Not intentional races in CE, but possible races in user code (I guess we could say that these are forbidden...)
...
are the thread-safe objects in CE expected to be really thread-safe?
Tough questions. I don't know. I see what you are saying, esp. about the definition of thread-safety. "Thread-safe but only if published with appropriate fencing" doesn't give the usual cozy thread-safe feeling.
In Scala Native, my argument has been that the number of developers designing algorithms based on data races is small and that the onus should fall to them to know what they are doing.
For Cats Effect users to encounter these sorts of things requires creeping out of the "idiomatic" patterns into a wildly unsafe world. May I cop-out and declare these things as implementation-specific? 😅
LocalQueue#getTotalFiberCount and others below it read plain
var
s which arelong
s; this is not atomic, and they can read values which were never written. (This is only used by theLocalQueueSamplerMBean
, so again, not critical, but still...)
Mmm. Seems like a legit bug, independent of this discussion.
I believe this would make totally reasonable code like this incorrect. In fact, it would make incorrect the whole idiom of using a
Deferred[F, Unit]
to "wake up" someone, who should (re)try doing something with aRef
.
This is a fantastic observation. Thank you so much for thinking about this :)
I am going to think more about it, but even if hypothetically there is a "better" or "more idiomatic" pattern that could be used instead, I absolutely agree with you that it is very very reasonable code. Paired by the fact that we cannot implement consume
-like ordering anyway, we should just spec acquire
for Ref
.
instead of clumsily applying
@volatile
we can actually load/store with the precise orderings that we want.
We can absolutely do this on the JVM too (versions > 8).
Suffice to say, it seems less likely that changing our implementation on the JVM will bring any optimizations. But we have a lot more potential to do so on Scala Native.
👍
IOFiber#captureTrace
reading tracingEvents
: I think we can read an IOFiber
through a race. Here we read fibers from WeakBag
s, and we can see a fiber another thread is writing. Entry
in WeakBag
stores the fiber in a final
field, but if those are not working in the way they're on the JVM, we can see a "half-initialized" fiber.
(Besides, saying that there are "no visibility guarantees anyway" when reading through a race, is a little strange to me. There are guarantees for final
fields... we're just discussing relaxing them :-)
#3781: seeing "at least" the initial 0L
is guaranteed (on the JVM definitely). But the problem is, that if it's not final
, (this being a long
), we can read "half" of it (I think it's called "tearing"?). (Edit: tearing is something else, this is just the lack of access atomicity.)
The closure is a very good point. (I was thinking about closures, but didn't put together the two things.) I expect this is a thing that will be easy to forget. (Btw., how do you use the new scala-native annotation on the fields of a closure?)
It's very difficult to escape synchronization.
💯
If we try to avoid synchronization for performance in specific cases, like in #3781; and also avoid synchronization in general for final
fields (like in scala-native 0.5)... that has consequences...
About thread-safety of WSTP and others:
In Scala Native, my argument has been that the number of developers designing algorithms based on data races is small and that the onus should fall to them to know what they are doing.
Sure. But maybe we're underestimating the number of unintentional data races? I suspect the reason the JVM has more or less "reasonable" semantics even for data races is exactly because they expected them to be common, and wanted to avoid the situation in C/C++, where it's undefined behavior. Whether they're really that common (or if they're common specifically in Scala) is a very good question...
May I cop-out and declare these things as implementation-specific?
That's basically saying "no" 🙂 (i.e., generic code will need to publish them with synchronization.)
This is a fantastic observation. Thank you so much for thinking about this :)
Sure thing, it's interesting stuff. (Although now I know more about the C++ memory model than I've ever wanted to...)
(Sorry for slow cadence, got a bit behind on stuff :)
Besides, saying that there are "no visibility guarantees anyway" when reading through a race
Sorry, let me try to explain again. Your original comment was:
IOFiber#captureTrace
reads the final fieldtracingEvents
. If it readsnull
, we get no trace in our liveFiberSnapshot; not the end of the world, but also not nice.
If we are reading a reference to a fiber via a race, then we might not even see that fiber. As you say, not the end of the world, but also not nice. Suppose we did read a non-null
reference to the fiber. Now we would attempt to read its tracingEvents
field, and as you point out this could also be null
. Still not the the end of the world, and also not nice, but is it any more not nice? 🤷 We might not have seen this fiber at all anyway.
I think it would be especially not nice if we were guaranteed to see the reference to the fiber, but not guaranteed to see its contents. But when we are reading references to fibers themselves via a race, this is not the case: we are neither guaranteed to see the reference nor its contents. This is what I meant by "no visibility guarantees anyway".
seeing "at least" the initial
0L
is guaranteed (on the JVM definitely)
I suppose so, but this is less clear to me on Scala Native. What confuses me is that the same address spaces may be used over and over again for different objects as stuff gets GCed and allocated. So in the case of data race I don't understand what prevents you from reading (stale) values of an old, GCed object and how you can be confident you are seeing at least the "cleared" values i.e. 0
s (which is still stale).
But the problem is, that if it's not
final
, (this being along
), we can read "half" of it
Mmm, right, you are referring to 17.7. Non-Atomic Treatment of double and long?
The closure is a very good point. (I was thinking about closures, but didn't put together the two things.) I expect this is a thing that will be easy to forget.
Indeed.
(Btw., how do you use the new scala-native annotation on the fields of a closure?)
By writing it out as a top-level class
with the appropriate fields that extends FunctionN
. Then those fields can be annotated as required.
On Scala JVM and JS using a closure vs a class generates meaningfully different code (e.g. invokedynamic
), but in Scala Native it ends up encoding to the same thing i.e. the closure is encoded as a class. So explicitly writing it out as an explicit class
shouldn't add any performance penalty.
But maybe we're underestimating the number of unintentional data races? I suspect the reason the JVM has more or less "reasonable" semantics even for data races is exactly because they expected them to be common
Tough to say. final
fields are everywhere in functional Scala which loves immutability, but that's less the case for Java particularly at the time when the JMM was introduced in Java 1.5 IIRC. So my gut feeling is that final
field semantics had much less impact on Java code (and any data races that were common/uncommon) than it did for Scala code.
May I cop-out and declare these things as implementation-specific?
That's basically saying "no" 🙂 (i.e., generic code will need to publish them with synchronization.)
Haha, fair enough. I guess I was casting doubt on whether such code could be generic anyway i.e. if someone has dropped low-level enough that they have to worry about safety of WSTP when published through data race, chances are they are writing code that already cannot be generic across Scala Native and Scala JVM and needs to be split into platform-specific implementations.
But as you say, maybe this "low level" is not quite as low and uncommon as I think it is.
About "no visibility guarantees anyway": thanks, I understand now. I agree.
So in the case of data race I don't understand what prevents you from reading (stale) values of an old, GCed object
I know (almost) nothing about scala-native, but this indeed seems an important thing to think about; fwiw, I'd consider it a (serious) bug in scala-native, if it's possible to see values from a GC'd object.
Mmm, right, you are referring to 17.7. Non-Atomic Treatment of double and long?
👍
By writing it out as a top-level class with the appropriate fields ...
I'd say the fact that we'd have to do this shows that this scala-native annotation is not that well thought through. The annotation itself could be made cross-compilation friendly (by making a stub for JVM/JS), but doing this requires split sources. Which are a pain...
I guess I was casting doubt on whether such code could be generic anyway
Ehh... I see your point. But I still see value in having certain "inherently" multi-threading-adjacent objects being "unconditionally" thread-safe. E.g., Thread
objects themselves, ExecutionContext
s, and so on.
Further reading about final fields (and other related things, like NonAtomic vs. Unordered) in kotlin native: https://github.com/JetBrains-Research/KotlinMemoryModelResearch/blob/master/research/final-fields.md (and also other documents in that repo).
Even more reading: https://weakmemory.github.io/project-theses/studying-compilation-schemes-KN.pdf.