On 2023-05-23, John A De Goes tweeted:
Take a wrapper (
newtype
) structure like:case class Box[A](value: A)
A simple local expression likeBox(42).value
has much more overhead than you would think:
- Allocation of 12 bytes on the heap for Box object header + 4 more bytes for a reference
- Allocation of 12 bytes on the heap for
java.lang.Integer
object header + 4 more bytes for an integer- Invocation of a virtual getter method that Scala quietly generates for the
value
field- Invocation of two separate constructors, one for Box and one for
java.lang.Integer
The heap allocations put pressure on the garbage collector, while the access through getter adds indirection overhead (which, thankfully, the JVM can sometimes optimize on its own), and the constructors add jumps. The net effect of these sources of overhead, but especially the size and quantity of heap allocations, is so significant, that if you’re doing high-performance work in Scala, you have to think long and hard about the costs of abstraction. If you don’t use abstraction, your code is hard to maintain, low-level, and fragile, and fails to take advantage of the type system of Scala. If you do use abstraction, your code becomes easy to maintain, high-level, and robust, fully taking advantage of Scala’s type system, but suffers from unacceptable overhead in some high-performance scenarios. One of the most refreshing facts about Rust is that...
Sounds reasonable, but all that above is more complex. The Scala, Java, or JVM specifications do not enforce 12 bytes for such structures!
Nothing in the JVM specification requires that the JVM will allocate anything on the physical heap, nor that there will be any real GC usage.
All those are implementation details, and while the quoted statement is true for most commonly used OpenJDK or Hotspot derivatives, as for 2023 it is not valid for all JVMs, and may not be valid for future versions of OpenJDK/Hotspot. More importantly, it is not true today (!!!) for GraalVM, which is a more advanced JVM implementation, and the case presented is one that that Graal optimized a long time ago.
I will examine the original claim. We will use Scala 3.3 (but it does not matter), OpenJDK (17.0.6
), and GraalVM (22.3.1
).
Let's use JMH
, and sbt
with jmh plugin
to measure the performance.
A popular Collatz conjecture is used in this example for benchmarking:
@Benchmark
def longCollatz(): Long =
var n = 27L
while (n > 1) {
if (n % 2 == 0) {
n = n / 2
} else {
n = 3 * n + 1
}
}
n
This is a very simple imperative version, that uses primitive values (longs), and primitive operations (add, mul, div, mod).
Value 27 is chosen because it generates 111 internal iterations. This is enough to get some meaningful results.
Now, I can implement this using Box class
defined as in John's tweet:
case class Box[A](value: A)
@Benchmark
def boxCollatz(): Long =
var n = Box(27L)
while (n.value > 1) {
if (n.value % 2 == 0) {
n = Box(n.value / 2)
} else {
n = Box((3 * n.value + 1))
}
}
n.value
This code performs multiple boxing-unboxing operations, which serves for a quite good example.
Next, if you are already using types, you are not really using variables and loops, and you are actually using functional programming.
That is why I created more pure variation of the code above (using recursion):
@Benchmark
def boxCollatzR(): Long =
def go(n: Box[Long]): Box[Long] =
if (n.value > 1) {
if (n.value % 2 == 0) {
go(Box(n.value / 2))
} else {
go(Box(3 * n.value + 1))
}
} else {
n
}
go(Box(27L)).value
Finally, I can also create a variant that uses opaque types (which only works in a modern scala 3.x).
@Benchmark
def opaqueBoxCollatzR() :Long =
def go(n: OpaqueBox[Long]): OpaqueBox[Long] =
if (n.value > 1) {
if (n.value % 2 == 0) {
go(OpaqueBox(n.value / 2))
} else {
go(OpaqueBox(3 * n.value + 1))
}
} else {
n
}
go(OpaqueBox(27L)).value
Eventually, I am ready to do some testing.
The tests were done on my regular desktop (Threadripper CPU).
I did not invest much time into preparing clean environment for testing, and did not even take measurements for a long time - but the results should be meaningful anyhow.
The exact command I used was:
sbt "jmh:run -prof gc -w 2s -r 5s -wi 3 -i 3 -f 3 jmh.CaseBoxing"
I added -prof gc
option (prof in JMH is a powerful tool) in order to show an interesting detail.
That is what I got when I ran those examples under OpenJDK 17.0.6:
openjdk version "20.0.1" 2023-04-18
OpenJDK Runtime Environment (build 20.0.1+9-29)
OpenJDK 64-Bit Server VM (build 20.0.1+9-29, mixed mode, sharing)
[info] Benchmark Mode Cnt Score Error Units
[info] CaseBoxing.boxCollatz avgt 9 669.842 ± 18.420 ns/op
[info] CaseBoxing.boxCollatz:·gc.alloc.rate avgt 9 5285.721 ± 145.687 MB/sec
[info] CaseBoxing.boxCollatz:·gc.alloc.rate.norm avgt 9 3712.000 ± 0.001 B/op
[info] CaseBoxing.boxCollatz:·gc.count avgt 9 292.000 counts
[info] CaseBoxing.boxCollatz:·gc.time avgt 9 721.000 ms
[info] CaseBoxing.boxCollatzR avgt 9 657.624 ± 13.385 ns/op
[info] CaseBoxing.boxCollatzR:·gc.alloc.rate avgt 9 5383.351 ± 109.478 MB/sec
[info] CaseBoxing.boxCollatzR:·gc.alloc.rate.norm avgt 9 3712.000 ± 0.001 B/op
[info] CaseBoxing.boxCollatzR:·gc.count avgt 9 295.000 counts
[info] CaseBoxing.boxCollatzR:·gc.time avgt 9 677.000 ms
[info] CaseBoxing.longCollatz avgt 9 102.574 ± 2.663 ns/op
[info] CaseBoxing.longCollatz:·gc.alloc.rate avgt 9 ≈ 10⁻⁴ MB/sec
[info] CaseBoxing.longCollatz:·gc.alloc.rate.norm avgt 9 ≈ 10⁻⁵ B/op
[info] CaseBoxing.longCollatz:·gc.count avgt 9 ≈ 0 counts
[info] CaseBoxing.opaqueBoxCollatzR avgt 9 103.874 ± 3.162 ns/op
[info] CaseBoxing.opaqueBoxCollatzR:·gc.alloc.rate avgt 9 ≈ 10⁻⁴ MB/sec
[info] CaseBoxing.opaqueBoxCollatzR:·gc.alloc.rate.norm avgt 9 ≈ 10⁻⁵ B/op
[info] CaseBoxing.opaqueBoxCollatzR:·gc.count avgt 9 ≈ 0 counts
This is exactly the result that John De Goes is describing. We have a simple imperative version that takes 102 ns per run. And all the wrappers induce a visible cost resulting in ≈670 ns per operation. That is over 6 (almost 7) times more. The notable exception is a Scala 3 use of opaque type which gives results similar to an imperative code.
It is clearly visible that all "functional" solutions were putting a stress on a garbage collector, which was performing more than 5MB allocations per second.
What happens if I run the same benchmarks on Graal (22.3.1 - just a version I had at the moment of writing)?
java version "17.0.6" 2023-01-17 LTS
Java(TM) SE Runtime Environment GraalVM EE 22.3.1 (build 17.0.6+9-LTS-jvmci-22.3-b11)
Java HotSpot(TM) 64-Bit Server VM GraalVM EE 22.3.1 (build 17.0.6+9-LTS-jvmci-22.3-b11, mixed mode, sharing)
[info] CaseBoxing.boxCollatz avgt 9 72.133 ± 5.547 ns/op
[info] CaseBoxing.boxCollatz:·gc.alloc.rate avgt 9 0.001 ± 0.002 MB/sec
[info] CaseBoxing.boxCollatz:·gc.alloc.rate.norm avgt 9 ≈ 10⁻⁴ B/op
[info] CaseBoxing.boxCollatz:·gc.count avgt 9 ≈ 0 counts
[info] CaseBoxing.boxCollatzR avgt 9 68.404 ± 1.571 ns/op
[info] CaseBoxing.boxCollatzR:·gc.alloc.rate avgt 9 0.001 ± 0.002 MB/sec
[info] CaseBoxing.boxCollatzR:·gc.alloc.rate.norm avgt 9 ≈ 10⁻⁴ B/op
[info] CaseBoxing.boxCollatzR:·gc.count avgt 9 ≈ 0 counts
[info] CaseBoxing.longCollatz avgt 9 63.048 ± 0.804 ns/op
[info] CaseBoxing.longCollatz:·gc.alloc.rate avgt 9 0.001 ± 0.002 MB/sec
[info] CaseBoxing.longCollatz:·gc.alloc.rate.norm avgt 9 ≈ 10⁻⁴ B/op
[info] CaseBoxing.longCollatz:·gc.count avgt 9 ≈ 0 counts
[info] CaseBoxing.opaqueBoxCollatzR avgt 9 62.978 ± 0.870 ns/op
[info] CaseBoxing.opaqueBoxCollatzR:·gc.alloc.rate avgt 9 0.001 ± 0.002 MB/sec
[info] CaseBoxing.opaqueBoxCollatzR:·gc.alloc.rate.norm avgt 9 ≈ 10⁻⁴ B/op
[info] CaseBoxing.opaqueBoxCollatzR:·gc.count avgt 9 ≈ 0 counts
[success] Total time: 258 s (04:18), completed May 29, 2023, 7:36:51 PM
It is clear that all the methods are more or less equal (and faster than OpenJDK case).
The reason for that is quite noticable in gc.alloc.rate
-> there were no allocations (!!!).
Yes, the Graal was smart enough to find out that the code does not need any.
Interestingly, all this happened on the JIT level. The allocations were in Scala code, they were still visible in the bytecode (that is why analysing bytecode is quite often misleading). But the JIT (Graal) was still able to eliminate them.
I wanted to demonstrate two issues:
- Functional code does not have to be slow/unfriendly to CPUs - it is only a matter of ever improving compilers, runtimes and platforms. I do agree that currently (2023) our compilers (generally) still have a lot of room for improvements. If someone is indeed fighting for cycles it is probably wise to sometimes go back in some fragments of code into imperative style or use lower level languages. On the other hand, the more FP we do the more pressure we put on vendors to provide compilers that are able to optimize such code (that is exactly the case for Graal).
- It is very easy to spread performance myths - just by sending statements that seem obvious and are actually true in some contexts. But they are not true in all contexts - and times, machines, compilers, runtimes, platforms are changing. The problem is that myths prevail and quite often become widespread years after they are not valid anymore. So be careful when you state some absolute truths about performance, be suspicious when you read them. And sometimes just try to verify.