Thread-safety for union types
straight-shoota opened this issue ยท 7 comments
Since the introduction of the multi-threading preview (#8112), we have been aware that the representation of union types is not thread-safe.
Quoting from the introductory blog post:
In Crystal, unions between value-types and reference-types are represented as a tuple of a type id and the value itself. A union of
Int32 | AClass
is guaranteed to not have anil
value. But due to interleaving, the representation of anil
can appear, and a null pointer exception (a segfault in this case) will happen.
[...]
For now you will need to be aware that shared unions that can appear in class variables, instance variables, constants or closured variables are not safe (but will be).
The following example program demonstrates the issue. Parallel access from multiple threads can result in the class variable's type id being set by the first thread (to Foo
), but the value is set by the second thread (the integer 0
). Reading this combination results in a null pointer in a reference type that is not nilable.
Hence the program consistently runs into invalid memory access (within a couple hundred iterations or so). Note: It doesn't matter if the code is compiled with -Dpreview_mt
or without. It uses the non-public Thread
API for explicitness.
class Foo
class_property foo : Int64 | Foo = 0
INSTANCE = Foo.new
@value = 123
end
Thread.new do
loop do
Foo.foo = Foo::INSTANCE
end
end
Thread.new do
loop do
Foo.foo = 0
end
end
loop do
p Foo.foo # Segfault: Invalid memory access (signal 11) at address 0x4
end
It should be clear that crashing the process is unaceptable. The runtime must ensure sound behaviour of union types in a multi-threaded environment.
This topic as also been mentioned in MT schedulers impact (forum.crystal-lang.org). There it has been suggested that doing an atomic compare-and-swap could work for some unions, but not all (especially not for larger structs, @RX14). Another suggested approach is a spin-lock utilizing unused bits of the type id (@funny-falcon).
It should be clear that crashing the process is unaceptable. The runtime must ensure sound behaviour of union types in a multi-threaded environment.
Why is it unacceptable? This is a case of shared memory access and I'd not expect that to be multi thread safe by default, as the performance overhead would be insane. If someone wants that, why is not not reasonable that they handle that themselves?
Also note that this is not in any way unique to union types - this is just as much a problem for big structs that is not part of a union type.
That said - it can perhaps be reasonable to make certain kinds of variables thread safe by default. Not in any way limited to union types. Class properties in particular seems particularly suited to be protected against multi threaded usage due to their global nature.
This is issue is particularly relevant because it's about corrupted metadata in the type system. The type id indicating a different type than the data it describes is at a different level of severity than "just" inconsistent data.
Crystal's type system has a strong promise of safety, and this should hold up under all conditions.
A common idiom in Crystal is assigning the value of an ivar or the result of a method call to a local variable so that the type system can reason about it without outside influences (for example if foo = @foo.as?(Bar)
- now foo
is restricted to Bar
). The very reason this exists is to ensure sound typing in a multi-threaded environment. One requirement for this to work is that reading the instance variable is a sound operation (in terms of the type system). This must be guaranteed for all types.
If we could not uphold this, we'd need to start questioning the understaning of type safety in Crystal.
This is a case of shared memory access and I'd not expect that to be multi thread safe by default, as the performance overhead would be insane. If someone wants that, why is not not reasonable that they handle that themselves?
Safety comes first. Crystal usually tries to make the default behaviour as safe as possible to avoid unexpected footguns.
More performant alternatives with less safety gurantees are usually opt-in, not opt-out.
@yxhuvud we're not talking about "obvious" cases like mutating an Array, I'd remember to protect accesses (usually) but updating one value? That's usually thread safe... until it's not because Crystal decided to create an union? Nasty. That could go unnoticed and lead to modify whatever in a program ๐จ
Performance hit should be taken with a grain of salt (and measured). Globals or even closured data may not be mutated that much that the impact would lead to terrible performance. Static analysis could detect safe cases and skip some safety checks. Heck, look at this article the impact of always checking a generational number every time it dereferences a pointer was only 10% (before static analysis and manual opt-in to skip checks) ๐ฒ
we're not talking about "obvious" cases like mutating an Array, I'd remember to protect accesses (usually) but updating one value? That's usually thread safe... until it's not because Crystal decided to create an union? Nasty. That could go unnoticed and lead to modify whatever in a program
The "usually" part is hard to justify when any sufficiently large aggregate will have the same issue; after all, an Int64 | Foo
is really just an {Int32, Int64[1]}
at the ABI level. IMO both circumstances are equally obvious here; restricting thread-level parallelism to spawn
does not magically relieve the programmer of the responsibility to take care of concurrent accesses, whether they manifest via the type system or not.
That said, MT safety looks justifiable for every use of class_getter
/ class_property
in the standard library, most of them due to #14905. This also goes for types that fit into a single register and aren't polymorphic, like Log.progname : String
. So I wouldn't mind if class_*
is MT-safe by default, as long as the same concurrent control isn't baked into the ABI level, i.e. this will still require manual synchronization by the programmer:
class Foo
@@foo : Int64 | Foo = 0
INSTANCE = Foo.new
@value = 123
Thread.new do
loop do
@@foo = INSTANCE
end
end
Thread.new do
loop do
@@foo = 0
end
end
loop do
p @@foo # this is expected to break
end
end
the intent being that if a faster alternative is available, such as grouping multiple accesses into a critical section, then the programmer should be able to express precisely that without the hidden abstraction costs.
We do lots of expensive bound checks and lots more checks everywhere to avoid undefined behavior as C is plagued with, but union types, which are core to the crystal language, and whose codegen is an internal implementation detail of the compiler, are left to all the developers to take care of ๐
Getters are one thing, but what about a closured union that becomes shared across threads?
We can explore ways to make it safe, ways to opt out of safety (then the developer becomes responsible for the safety) and explore ways to skip the checks, then we can decide how to proceed.
Again - this has nothing to do with union types - this is just as problematic when accessing big structs. If they can change while reading the value then you are out of luck.
Consider the original example but changing foo
property to instead be an instance of type
struct Foo
property foo : Int64 | Foo = 0
end
Would that be type safe under the suggested scheme? I don't see how short of implementing a borrow checker.
Getters are one thing, but what about a closured union that becomes shared across threads?
The only way to reach sanity in the case of shared closures is to a: make sure it is properly initialized before accessing it from other threads and b: Don't change the value. If we can figure out a way to enforce properties like that then that could be a way to achieve safety.
edit: oh, and c: Make sure the value actually is there, in case it refer to a local variable. Unless there is something that make sure it doesn't go out of scope (like structured concurrency) there really isn't anything guaranteeing the originating scope hasn't returned.
This issue has been mentioned on Crystal Forum. There might be relevant details there:
https://forum.crystal-lang.org/t/charting-the-route-to-multi-threading-support/7320/1