fosskers/scala-benchmarks

Point of ArrayBuilder

mpilquist opened this issue · 12 comments

I’m not sure what the point of ArrayBuilder is.

I believe the point is building an array for which you don't know the final size. Presumably ArrayBuilder will give you better performance than first building something like a mutable.Buffer and then converting to an array (would be nice to confirm this assumption).

Aside: might be worth considering adding benchmarks for the various builders that use sizeHint.

I believe the point is building an array for which you don't know the final size.

Right, looks like Array appears faster here because we know the size ahead of time. Seems like ArrayBuilder boxes its contents? There's an order-of-magnitude slowdown relative to Array, and boxing is the usual culprit in my experience.

Hm, ArrayBuilder appears to avoid boxing based on a quick look at the source, but it looks like you have to call the appropriate constructor directly instead of calling make? Here's what I suspect, though I could be totally misreading the source:

scala> val bldr = new collection.mutable.ArrayBuilder.ofInt()
bldr: collection.mutable.ArrayBuilder.ofInt = ArrayBuilder.ofInt

scala> bldr += 1 // This doesn't box
res6: bldr.type = ArrayBuilder.ofInt

scala> val bldr2 = collection.mutable.ArrayBuilder.make[Int]
bldr2: scala.collection.mutable.ArrayBuilder[Int] = ArrayBuilder.ofInt

scala> bldr2 += 1 // This boxes
res7: bldr2.type = ArrayBuilder.ofInt

This highlights exactly one of my biggest issues with the standard collections: no bias. Users aren't steered toward or away from any particular collection or operator. Why does make exist if we're punished for reasonable use of it (vs ofInt)?

Oh I completely agree. It shouldn't be so easy to write really badly performing code. I'd love a collection library that was designed to make efficient use easy / inefficient use hard. I'd happily take that tradeoff over API consistency.

Amen brother.

@mpilquist I switched to ofInt, but am still seeing the same 5x slowdown compared to an Array of known size.

Hm I think that might bust the boxed/unboxed claim then. For comparison purposes, can you try using .sizeHint(arraySize)? This should hopefully avoid any successive copies as the array grows larger during += calls.

WIth sizeHint, where you hint the exact correct size of the final result:

[info] Benchmark                   Mode  Cnt     Score    Error  Units
[info] VectorBench.abuilder1000    avgt    5     0.601 ±  0.018  us/op
[info] VectorBench.abuilder10000   avgt    5     6.090 ±  0.100  us/op
[info] VectorBench.abuilder100000  avgt    5    64.772 ± 16.078  us/op
[info] VectorBench.array1000       avgt    5     0.591 ±  0.008  us/op
[info] VectorBench.array10000      avgt    5     5.397 ±  0.037  us/op
[info] VectorBench.array100000     avgt    5    53.866 ±  0.884  us/op

It probably pre-allocates a load of space in this case eh.

Awesome, so 5x penalty in the 100k case for not knowing the size upfront and no boxing/unboxing occurring. The 5x penalty is higher than I would have expected. Looks like ArrayBuilder starts at size 16 and then successively doubles, so you end up with 17 array copies for the 100k test.

More data. Here's when you hint 50% smaller than the final size:

[info] Benchmark                   Mode  Cnt     Score    Error  Units
[info] VectorBench.abuilder1000    avgt    5     2.018 ±  0.029  us/op
[info] VectorBench.abuilder10000   avgt    5    23.916 ±  0.193  us/op
[info] VectorBench.abuilder100000  avgt    5   199.556 ±  1.771  us/op
[info] VectorBench.array1000       avgt    5     0.589 ±  0.011  us/op
[info] VectorBench.array10000      avgt    5     5.396 ±  0.035  us/op
[info] VectorBench.array100000     avgt    5    53.998 ±  0.889  us/op

which is only barely faster than not hinting at all.

And when you over-estimate by a factor of 2:

[info] Benchmark                   Mode  Cnt     Score    Error  Units
[info] VectorBench.abuilder1000    avgt    5     1.145 ±  0.032  us/op
[info] VectorBench.abuilder10000   avgt    5    11.406 ±  0.152  us/op
[info] VectorBench.abuilder100000  avgt    5   121.283 ±  6.380  us/op
[info] VectorBench.array1000       avgt    5     0.589 ±  0.005  us/op
[info] VectorBench.array10000      avgt    5     5.389 ±  0.056  us/op
[info] VectorBench.array100000     avgt    5    53.987 ±  0.357  us/op

You still get punished, but only by a factor of 2.

More news: .sizeHint doesn't help when you're working with Arrays of classes.