Measuring print loops

On February 23rd, 2020, the Twitter account @java tweeted out the following:

A simple tip to print the Array or List with index in the front. #Java

http://oracl.info/xu1B50ypCAU

— @java on Twitter
https://twitter.com/java/status/1231443603102294019

While this does of course work, it may not be the best solution (both in terms of readability and in terms of performance).

Other suggestions made in the replies were:

Suggestion by Colin MacLean in this tweet

int i = 0;
for (String name : names) {
   System.out.println(i++ + “: “ + name);
}

and

Suggestion by me in this tweet

for (int i=0; i < names.length; i++) {
  System.out.println(i + ": " + names[i]);
}

Then, Carsten Reckord suggested, that someone writes a microbenchmark to compare the three options. This is that microbenchmark suite.

ℹ️

I decided to add one further alternative, which uses streams and may be more comparable to the other two solutions:

IntStream.range(0, names.length)
        .mapToObj(index -> index + ": " + names[index])
        .forEach(blackhole::consume);

The original stream solution looped through the names twice (once when mapping them and once when printing them). This solution only goes through them once (since forEach is the only terminal stream operation, whereas in the original solution there was also collect.)

This alternative is not contained in the original measurements below - in part because I only thought of this problem after I had already started measuring and in part because every new benchmark increases the runtime considerably. I will add the results for this new benchmark in separate gists.

Implementation

This microbenchmark suite uses JMH, a “Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targetting the JVM.” ^[1]

Rather than using System.out.println(..) (which will synchronously output the given string to stdout by default), I decided to use an instance of Blackhole#consume(Object) instead. This will prevent dead-code elimination by the runtime. The blackhole objects in the following code are instances of Blackhole.

With that, we have three implementations:

The original suggestion using Streams

List<String> collect = IntStream.range(0, names.length)
        .mapToObj(index -> index + ": " + names[index])
        .collect(Collectors.toList());

collect.forEach(blackhole::consume);

The suggestion by Coling MacLean using an enhanced for loop

int i = 0;
for (String name : names) {
    blackhole.consume(i++ + ": " + name);
}

My suggestion using a basic for loop with an index

for (int i = 0; i < names.length; i++) {
            blackhole.consume(i + ": " + names[i]);
        }

Running these benchmarks

The benchmarks can be compiled with Maven:

$ ./mvnw clean package -DskipTests

for the projects default Java version (8) to be used or

$ ./mvnw clean package -DskipTests -Djava.version=<java version>

where <java version> can be anything between 8 and 14 (or maybe later), assuming you have a matching JDK available to Maven. To check the Java version available to Maven, you can run:

$ ./mvnw -version
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: <home path>/.m2/wrapper/dists/apache-maven-3.6.3-bin/1iopthnavndlasol9gbrbg6bf2/apache-maven-3.6.3
Java version: 13.0.1, vendor: AdoptOpenJDK, runtime: /usr/lib/jvm/adoptopenjdk-13-hotspot-amd64
Default locale: de_DE, platform encoding: UTF-8
OS name: "linux", version: "5.3.0-40-generic", arch: "amd64", family: "unix"

To change the projects default Java version, you can modify the java.version property in the pom.xml:

<properties>
    <!--
        Java version to use for compilation.
        Possible values: 8, 9, 10, 11, 12, 13, 14...
      -->
    <java.version>8</java.version>
<properties>

The benchmarks can then be run with:

$ java -jar target/benchmarks.java

⚠️	Running these benchmarks takes a long time. The total completion time for one call for me (with the original 3 benchmarks) was about 50 minutes. Having added a fourth benchmark (as described here) the time is even longer (estimated: 1:06:40 hours).

Alternatively, a version of the tests with fewer runs can also be run with

$ ./mvnw test

ℹ️

The benchmarks run with ./mvnw test are defined in the MyBenchmarkTest.java class. These tests have the following run parameters:

Options opt = new OptionsBuilder()
        .measurementTime(TimeValue.seconds(1))
        .measurementIterations(3)
        .forks(3)
        .include(MyBenchmark.class.getSimpleName())
        .build();

In comparison, the full test suite has the following definitions:

@Measurement(
        iterations = 5,
        time = 10,
        timeUnit = TimeUnit.SECONDS
)
@Fork(5)

This reduction reduces the time to run all tests to an estimated 21:12 minutes.

CI job

This project is set up to be built in Travis CI. The build job can be found here, including the measurements (using ./mvnw test) for multiple JDK versions.

Measurements

I ran all benchmarks with multiple Java versions. There are two variants (which is what the listVariant output refers to):

FIVE_NAMES

The original five names from the tweet: {"Java", "Node", "JavaScript", "Rust", "Go"}

AUTO_GENERATED_NAMES

Normally, you wouldn’t run this kind of algorithm on just 5 names. So, to make the whole thing a bit more realistic, I had the benchmark generate 1000 names:

// 1000 new names
final String[] firstNames = {"Alice", "Bob", "Charles", "Dora", "Emanuel", "Fabienne", "George", "Hannelore", "Igor", "Janice"};
final String[] middleNames = {"Kim", "Landry", "Maria", "Nikita", "Oakley", "Perry", "Quin", "Robin", "Skyler", "Taylen"};
final String[] surnames = {"Underhill", "Vaccanti", "Wilson", "Xanders", "Yallopp", "Zabawa", "Anderson", "Bell", "Carter", "Diaz"};
names = Arrays.stream(firstNames)
        .flatMap(firstName -> Arrays.stream(middleNames).map(middleName -> firstName + " " + middleName))
        .flatMap(firstAndMiddleName -> Arrays.stream(surnames).map(surname -> firstAndMiddleName + " " + surname))
        .toArray(String[]::new);

This name generation occurs within a function annotated with @Setup(Leve.Trial). This means, that the array is generated before each trial and this generation will not be included in the measurement itself.

With Java 8

Java version information for Java 8

$ java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~19.10-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

Benchmark results

Benchmark                                   (listVariant)   Mode  Cnt         Score        Error  Units
MyBenchmark.mapAndPrintWithStreams             FIVE_NAMES  thrpt   25   5207121,355 ± 276740,844  ops/s
MyBenchmark.mapAndPrintWithStreams   AUTO_GENERATED_NAMES  thrpt   25     27035,503 ±   2851,299  ops/s
MyBenchmark.printInEnhancedForLoop             FIVE_NAMES  thrpt   25  10380706,822 ± 786963,780  ops/s
MyBenchmark.printInEnhancedForLoop   AUTO_GENERATED_NAMES  thrpt   25     38088,946 ±    348,914  ops/s
MyBenchmark.printInOldSchoolForLoop            FIVE_NAMES  thrpt   25  10453896,992 ± 796088,166  ops/s
MyBenchmark.printInOldSchoolForLoop  AUTO_GENERATED_NAMES  thrpt   25     38599,330 ±    358,688  ops/s

For the original, shorter list, both for loops had about twice the throughput compared to the stream. With the much longer array of names, the enhanced for loop clocked in at about 1.41 times the throughput of the stream solution basic for loop managed about 1.43 times the throughput.

The complete output can be found in this gist.

With Java 11

Java version information for Java 11

$ java -version
openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+10-post-Ubuntu-1ubuntu119.10.1)
OpenJDK 64-Bit Server VM (build 11.0.6+10-post-Ubuntu-1ubuntu119.10.1, mixed mode, sharing)

Benchmark results

Benchmark                                   (listVariant)   Mode  Cnt        Score         Error  Units
MyBenchmark.mapAndPrintWithStreams             FIVE_NAMES  thrpt   25  4296358,352 ±  545014,011  ops/s
MyBenchmark.mapAndPrintWithStreams   AUTO_GENERATED_NAMES  thrpt   25    26803,661 ±    3479,115  ops/s
MyBenchmark.printInEnhancedForLoop             FIVE_NAMES  thrpt   25  9348700,863 ± 1023949,109  ops/s
MyBenchmark.printInEnhancedForLoop   AUTO_GENERATED_NAMES  thrpt   25    37878,684 ±    3026,570  ops/s
MyBenchmark.printInOldSchoolForLoop            FIVE_NAMES  thrpt   25  9819339,718 ± 1230652,392  ops/s
MyBenchmark.printInOldSchoolForLoop  AUTO_GENERATED_NAMES  thrpt   25    39678,611 ±    3593,970  ops/s

For the original, shorter list, the enhanced for loop had about 2.18 the throughput compared to the stream while the basic for loop had about 2.29 times the throughput. With the much longer array of names, the enhanced for loop clocked in at about 1.41 times the throughput of the stream solution basic for loop managed about 1.48 times the throughput.

The complete output can be found in this gist.

With Java 13

Java version information for Java 13

$ java -version
openjdk version "13.0.1" 2019-10-15
OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.1+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.1+9, mixed mode, sharing)

Benchmark results

Benchmark                                   (listVariant)   Mode  Cnt        Score        Error  Units
MyBenchmark.mapAndPrintWithStreams             FIVE_NAMES  thrpt   25  5355473,456 ± 527496,942  ops/s
MyBenchmark.mapAndPrintWithStreams   AUTO_GENERATED_NAMES  thrpt   25    26485,990 ±   2196,793  ops/s
MyBenchmark.printInEnhancedForLoop             FIVE_NAMES  thrpt   25  8839903,376 ± 991895,966  ops/s
MyBenchmark.printInEnhancedForLoop   AUTO_GENERATED_NAMES  thrpt   25    35993,876 ±   2339,770  ops/s
MyBenchmark.printInOldSchoolForLoop            FIVE_NAMES  thrpt   25  9110027,753 ± 678835,086  ops/s
MyBenchmark.printInOldSchoolForLoop  AUTO_GENERATED_NAMES  thrpt   25    35559,584 ±   2999,247  ops/s

For the original, shorter list, the enhanced for loop had about 1.65 the throughput compared to the stream while the basic for loop had about 1.70 times the throughput. With the much longer array of names, the enhanced for loop clocked in at about 1.36 times the throughput of the stream solution basic for loop managed about 1.34 times the throughput.

The complete output can be found in this gist.

With Java 14 (early access)

Java version information for Java 14 (early access)

$ java -version
openjdk version "14-ea" 2020-03-17
OpenJDK Runtime Environment (build 14-ea+27-1339)
OpenJDK 64-Bit Server VM (build 14-ea+27-1339, mixed mode, sharing)

Benchmark results

Benchmark                                   (listVariant)   Mode  Cnt        Score        Error  Units
MyBenchmark.mapAndPrintWithStreams             FIVE_NAMES  thrpt   25  4755692,653 ± 410854,229  ops/s
MyBenchmark.mapAndPrintWithStreams   AUTO_GENERATED_NAMES  thrpt   25    24668,440 ±   2109,834  ops/s
MyBenchmark.printInEnhancedForLoop             FIVE_NAMES  thrpt   25  8991645,685 ± 841362,203  ops/s
MyBenchmark.printInEnhancedForLoop   AUTO_GENERATED_NAMES  thrpt   25    32641,574 ±   3623,476  ops/s
MyBenchmark.printInOldSchoolForLoop            FIVE_NAMES  thrpt   25  8576218,147 ± 790441,689  ops/s
MyBenchmark.printInOldSchoolForLoop  AUTO_GENERATED_NAMES  thrpt   25    27693,151 ±   2840,307  ops/s

For the original, shorter list, the enhanced for loop had about 1.89 the throughput compared to the stream while the basic for loop had about 1.80 times the throughput. With the much longer array of names, the enhanced for loop clocked in at about 1.32 times the throughput of the stream solution basic for loop managed about 1.12 times the throughput.

The complete output can be found in this gist.

Conclusions from those measurements

Both for loops were considerably faster than the stream solution suggested in the original tweet (about twice the throughput for the short list and about 1.4 times the throughput for the much longer list). In both cases, the enhanced for loop was slightly better than the basic for loop.

The exact measurements may vary depending on a number of factors, including but not limited to:

the hardware used to run the benchmarks
the Java version used to run the benchmarks
other processes running at the same time as the benchmarks

The finding, that both for loops are faster than the solution using streams however is very likely to hold up, even taking those considerations into account.

Other considerations

One major point to be considered when writing code is the readability of this code - ideally, code should be easy to understand. Or, to quote Martin Fowler:

Any fool can write code that a computer can understand. Good programmers write code that humans can understand.

— Martin Fowler
Refactoring: Improving the Design of Existing Code, 1999, p. 15

This microbenchmark suite can not (and does not try to) measure readability, in part because this is subjective. However, here are my thoughts on the three implementations.

The original suggestion using Streams

List<String> collect = IntStream.range(0, names.length)
        .mapToObj(index -> index + ": " + names[index])
        .collect(Collectors.toList());

collect.forEach(blackhole::consume);

This has a total of 4 lines of code (not counting the empty line). To understand what it’s doing, you have to understand how the stream is created, which and how many elements are contained in that stream, how the mapping works and how the collection works.

The suggestion by Coling MacLean using an enhanced for loop

int i = 0;
for (String name : names) {
    blackhole.consume(i++ + ": " + name);
}

This also has a total of 4 lines of code. To understand what it’s doing, you have to understand how an enhanced for loop^[2] works and when the index is incremented for i++.

My suggestion using a basic for loop with an index

for (int i = 0; i < names.length; i++) {
    blackhole.consume(i + ": " + names[i]);
}

This has a total of 3 lines of code. To understand what it’s doing, you have to understand how a basic for loop works and when the index is incremented for i++.

I am of the opinion (and you may disagree on this), that the basic for loop is the best choice here for the following reasons:

it has the fewest lines of code while still being easy to understand
it requires only very basic knowledge of the Java language and no APIs (such as the Stream API)
it makes explicit use of the index i, not only to retrieve the item from the array but also to be part of the result

1. The article Avoiding Benchmarking Pitfalls on the JVM from July 2014 may be old, but it explains why using JMH helps avoid common mistakes when writing microbenchmarks.

2. also sometimes called a for each loop

blalasaadri/print-loop-measurements

Measuring print loops

Implementation

Running these benchmarks

CI job

Measurements

With Java 8

With Java 11

With Java 13

With Java 14 (early access)

Conclusions from those measurements

Other considerations