DoS vulnerability in Scala 2.12 HashMap
Closed this issue · 28 comments
In 2011, a vulnerability was raised against Java application servers about a DoS possibility that exploited Java String's vulnerability to collisions. This vulnerability was so widespread and fundamental that it was fixed not in the application servers, but in the JDK's HashMap
implementation.
Scala's HashMap
has the same vulnerability. This has been brought to our attention through this issue raised against play-json:
This vulnerability doesn't just affect play-json. It affects anything that uses Scala's default map implementation to store String keyed data where the keys are controlled remotely. So, HTTP headers, HTTP forms, JSON, any library that uses Scala's Map for any of these is vulnerable, so that includes Play, Akka HTTP, and many, many other Scala libraries.
The fix that the JDK did is quite simple, when buckets in the hash table got too big due to poor hashing (or malicious collisions), it reverted to essentially using a TreeMap
in the bucket, and if, determined by reflection, the keys are Comparable, it uses the compareTo method to compare them.
Currently, we use ListMap
in the case of collisions:
// 32-bit hash collision (rare, but not impossible)
new HashMapCollision1(hash, ListMap.empty.updated(this.key,this.value).updated(key,value))
Obviously that comment is wrong, if you're under attack, they won't be rare at all. I think a simple solution here would be to modify HashMapCollision1
such that it has both a ListMap
and a TreeMap
, anything that implements Comparable
can be put in and queried from the TreeMap
, and everything else in the ListMap
.
I don't think there's much consequence to doing this, the biggest impact will be a potential change in iteration order of colliding elements when you merge two maps - if the keys implement Comparable
, the ordering will change from keys from the first map followed by keys in the second map to lexical ordering, and if you're mixing Comparable
and non Comparable
keys, the ordering gets a bit weirder. But it's still stable. And, who really depends on ordering in hash maps? And it only affects ordering when there are collisions. There's also a slight increase in space used, but again, it's only for collisions, for 99.9999% of the world, it will have no impact.
How do you suggest we test whether something is comparable. Even if it is an instance of Comparable
, one cannot know what are valid arguments to the compareTo
method. Invalid arguments will throw ClassCastException
s, so you can't simply create a TreeMap
.
Also, that would fix it for Strings, but not for any custom case class that contains a String, because those would suffer from the same poor hash, without being Comparable anyway.
Another option would be to use a random seed: https://doc.rust-lang.org/std/collections/struct.HashMap.html
(some background on SipHash: http://131002.net/siphash/siphash.pdf, see in particular section 7)
... but I guess any kind of randomness wouldn't play well with sending things across the wire through serialization though
Here's on overview of the current situation (as of 2.13.0-M5): All our hash maps and hash sets are affected (and they all need to be fixed individually)
[info] Benchmark (comparable) (size) Mode Cnt Score Error Units
[info] HashCollisionBenchmark.champHashMap false 10 avgt 10 3087.112 ± 196.344 ns/op
[info] HashCollisionBenchmark.champHashMap false 100 avgt 10 82826.466 ± 1242.476 ns/op
[info] HashCollisionBenchmark.champHashMap false 1000 avgt 10 6008195.291 ± 83021.908 ns/op
[info] HashCollisionBenchmark.champHashMap false 10000 avgt 10 799354669.450 ± 42798033.928 ns/op
[info] HashCollisionBenchmark.champHashMap true 10 avgt 10 3206.333 ± 66.366 ns/op
[info] HashCollisionBenchmark.champHashMap true 100 avgt 10 83229.784 ± 1723.577 ns/op
[info] HashCollisionBenchmark.champHashMap true 1000 avgt 10 6023586.625 ± 99315.284 ns/op
[info] HashCollisionBenchmark.champHashMap true 10000 avgt 10 801143110.200 ± 37457371.571 ns/op
[info] HashCollisionBenchmark.champHashSet false 10 avgt 10 2661.455 ± 34.165 ns/op
[info] HashCollisionBenchmark.champHashSet false 100 avgt 10 47023.016 ± 502.364 ns/op
[info] HashCollisionBenchmark.champHashSet false 1000 avgt 10 4734575.751 ± 49540.034 ns/op
[info] HashCollisionBenchmark.champHashSet false 10000 avgt 10 451249571.900 ± 4053862.848 ns/op
[info] HashCollisionBenchmark.champHashSet true 10 avgt 10 2682.528 ± 59.833 ns/op
[info] HashCollisionBenchmark.champHashSet true 100 avgt 10 47751.985 ± 751.482 ns/op
[info] HashCollisionBenchmark.champHashSet true 1000 avgt 10 4830274.370 ± 87407.950 ns/op
[info] HashCollisionBenchmark.champHashSet true 10000 avgt 10 460022326.500 ± 5370622.228 ns/op
[info] HashCollisionBenchmark.javaHashMap false 10 avgt 10 435.207 ± 4.483 ns/op
[info] HashCollisionBenchmark.javaHashMap false 100 avgt 10 78052.174 ± 1002.036 ns/op
[info] HashCollisionBenchmark.javaHashMap false 1000 avgt 10 6690602.312 ± 132158.499 ns/op
[info] HashCollisionBenchmark.javaHashMap false 10000 avgt 10 1064348588.700 ± 21725697.542 ns/op
[info] HashCollisionBenchmark.javaHashMap true 10 avgt 10 435.106 ± 5.557 ns/op
[info] HashCollisionBenchmark.javaHashMap true 100 avgt 10 10339.026 ± 199.182 ns/op
[info] HashCollisionBenchmark.javaHashMap true 1000 avgt 10 138461.502 ± 5488.686 ns/op
[info] HashCollisionBenchmark.javaHashMap true 10000 avgt 10 1383198.954 ± 19537.305 ns/op
[info] HashCollisionBenchmark.javaHashSet false 10 avgt 10 431.236 ± 5.965 ns/op
[info] HashCollisionBenchmark.javaHashSet false 100 avgt 10 78141.561 ± 1103.317 ns/op
[info] HashCollisionBenchmark.javaHashSet false 1000 avgt 10 6740530.754 ± 90071.057 ns/op
[info] HashCollisionBenchmark.javaHashSet false 10000 avgt 10 1065812482.700 ± 21146822.799 ns/op
[info] HashCollisionBenchmark.javaHashSet true 10 avgt 10 429.406 ± 8.252 ns/op
[info] HashCollisionBenchmark.javaHashSet true 100 avgt 10 10467.322 ± 252.287 ns/op
[info] HashCollisionBenchmark.javaHashSet true 1000 avgt 10 142079.811 ± 3196.623 ns/op
[info] HashCollisionBenchmark.javaHashSet true 10000 avgt 10 1667129.061 ± 413304.829 ns/op
[info] HashCollisionBenchmark.mutableHashMap false 10 avgt 10 322.683 ± 5.309 ns/op
[info] HashCollisionBenchmark.mutableHashMap false 100 avgt 10 19328.574 ± 604.139 ns/op
[info] HashCollisionBenchmark.mutableHashMap false 1000 avgt 10 1335647.124 ± 34504.419 ns/op
[info] HashCollisionBenchmark.mutableHashMap false 10000 avgt 10 130574061.538 ± 2898111.233 ns/op
[info] HashCollisionBenchmark.mutableHashMap true 10 avgt 10 334.757 ± 30.975 ns/op
[info] HashCollisionBenchmark.mutableHashMap true 100 avgt 10 19717.810 ± 880.020 ns/op
[info] HashCollisionBenchmark.mutableHashMap true 1000 avgt 10 1349523.554 ± 61111.205 ns/op
[info] HashCollisionBenchmark.mutableHashMap true 10000 avgt 10 136440902.025 ± 8937760.447 ns/op
[info] HashCollisionBenchmark.mutableHashSet false 10 avgt 10 210.637 ± 10.214 ns/op
[info] HashCollisionBenchmark.mutableHashSet false 100 avgt 10 51278.552 ± 862.355 ns/op
[info] HashCollisionBenchmark.mutableHashSet false 1000 avgt 10 7254599.344 ± 77418.016 ns/op
[info] HashCollisionBenchmark.mutableHashSet false 10000 avgt 10 582626129.000 ± 22068501.128 ns/op
[info] HashCollisionBenchmark.mutableHashSet true 10 avgt 10 207.018 ± 3.522 ns/op
[info] HashCollisionBenchmark.mutableHashSet true 100 avgt 10 50332.322 ± 722.652 ns/op
[info] HashCollisionBenchmark.mutableHashSet true 1000 avgt 10 7361789.516 ± 153332.591 ns/op
[info] HashCollisionBenchmark.mutableHashSet true 10000 avgt 10 581922084.150 ± 10862902.903 ns/op
[info] HashCollisionBenchmark.oldHashMap false 10 avgt 10 1509.231 ± 92.854 ns/op
[info] HashCollisionBenchmark.oldHashMap false 100 avgt 10 59816.860 ± 29110.914 ns/op
[info] HashCollisionBenchmark.oldHashMap false 1000 avgt 10 8008288.752 ± 82106.263 ns/op
[info] HashCollisionBenchmark.oldHashMap false 10000 avgt 10 1180388228.050 ± 554803591.812 ns/op
[info] HashCollisionBenchmark.oldHashMap true 10 avgt 10 1447.969 ± 33.201 ns/op
[info] HashCollisionBenchmark.oldHashMap true 100 avgt 10 59973.289 ± 28878.087 ns/op
[info] HashCollisionBenchmark.oldHashMap true 1000 avgt 10 8123624.173 ± 177455.498 ns/op
[info] HashCollisionBenchmark.oldHashMap true 10000 avgt 10 1117844772.850 ± 476160439.425 ns/op
[info] HashCollisionBenchmark.oldHashSet false 10 avgt 10 341.151 ± 6.199 ns/op
[info] HashCollisionBenchmark.oldHashSet false 100 avgt 10 15054.079 ± 659.371 ns/op
[info] HashCollisionBenchmark.oldHashSet false 1000 avgt 10 4705176.794 ± 92712.245 ns/op
[info] HashCollisionBenchmark.oldHashSet false 10000 avgt 10 469873977.200 ± 8556272.917 ns/op
[info] HashCollisionBenchmark.oldHashSet true 10 avgt 10 336.080 ± 5.549 ns/op
[info] HashCollisionBenchmark.oldHashSet true 100 avgt 10 15147.252 ± 1076.673 ns/op
[info] HashCollisionBenchmark.oldHashSet true 1000 avgt 10 4761012.752 ± 77525.665 ns/op
[info] HashCollisionBenchmark.oldHashSet true 10000 avgt 10 463202453.300 ± 37754187.404 ns/op
BTW, we use different collision resolution schemes in our implementation. I haven't looked at all the immutable maps and sets yet but I just started working on mutable HashMaps and HashSets (independently of this ticket). mutable.HashMap
uses separate chaining with linked lists. This could be changed to tree-based chaining while keeping the general performance/memory characteristics. mutable.HashSet
uses open addressing where the proposed fix doesn't apply. We'd have to change it to use separate chaining, too.
I think the Java solution was probably expedient for Java, but it isn't inherently the right way to fix the issue. If you want a HashMap
which is robust to an attack on the hash values of the keys, you can't guarantee success if you put off until runtime the attempt to use compareTo
.
Instead, we need a new type of map where the keys are orderable but are not necessarily ordered. (Likewise with sets.)
Then we have the compile-time guarantee (so long as the ordering is sensible) that DOS cannot occur. This is far better for people wishing to produce robust systems, I think.
Alternatively, if we want a quick hacky fix, I'd just intercept the hashing of String
and use a better hash on it to reduce the risk. Changing our map datastructures to include a tree that won't even work, and doesn't play nice with things that can be ordered via a typeclass but don't have the trait baked in, is not a direction that I think we should go.
discussion at https://gitter.im/scala/contributors?at=5bc1dbe41e23486b93b70784 ("Do people expect hash maps to be secure against collisions of untrusted data?")
Customizable hashing is another option. If hash collections alloweduser-defined hashing methods you could use a more secure seeded hash for security-critical applications.
And no matter which way we go, 2.12.8 as a target is probably not possible (at least not for all affected collection types). Even with the magical behind-the-scenes hack that Java uses you need to change the internal data structures, which are not actually internal. They are exposed with package-private visibility and actively used by libraries like https://github.com/scala/scala-java8-compat/
If this is only (mostly) about the DOS attack vectors, could we just limit the number of allowed collisions and fail if there's a (configurable) suspicious ratio of collisions?
... but I guess any kind of randomness wouldn't play well with sending things across the wire through serialization though
You mean, under the expectation that you can compare hashCodes across JVM instances?
Otherwise, you don't really have to transfer hash codes with serialization and just recalculate them when deserializing.
I really don't want my hashmaps to have arbitrary implementation defined failures. if inserting certain elements starts failing in some way then my application is crippled, and this can still cause denial of services
You mean, under the expectation that you can compare hashCodes across JVM instances?
Yes, no clue what would break if this assumption is violated
I really don't want my hashmaps to have arbitrary implementation defined failures. if inserting certain elements starts failing in some way then my application is crippled, and this can still cause denial of services
It's not the same failure mode, though. It's like rejecting inputs that would lead to an OOM preventively with an exception. In this case, it would throw an exception if input was detected which has characteristics that don't fit the assumptions about your data when you chose a HashMap in the first place (i.e. that hashCodes are uniformly distributed). In which way would that cripple the application on innocent inputs?
Imagine a persistent hashmaps shared between multiple users, if I fill it with enough malicious data, then any attempt by other users to add something to it would throw an exception
Imagine a persistent hashmaps shared between multiple users, if I fill it with enough malicious data, then any attempt by other users to add something to it would throw an exception
Probably not, because while adding malicious data would push the ratio (of collisions per size) towards the limit adding any other data would in average move the ratio away from the limit.
spray-json switched to TreeMap
: spray/spray-json#277 ("for reasonably sized objects with a size < 100, a TreeMap is only ~6% slower than a HashMap")
Summary of our discussions.
2.12.8
- We could change HashMap to reflectifly identify keys of classes
C with Comparable[C]
and use trees instead of linked lists for collisions- The implementation complexity is high. A
HashMap[AnyRef, ...]
could have a mix of keys, in which case the implementation has to fall back to linked lists. The implementation needs to support lists and trees, and switch between these representations depending on the values added. - This would need to be done for many collection types
- Updating scala-java8-compat to work with the new implementations would also be a lot of work
- The change would cause a significant performance regression in a minor Scala release. (@szeiger is currently working on faster RB trees (mutable and immutable), but for 2.13. Backproting this to 2.12 would be even more work, also for scala-java8-compat).
- The scope is limited. It would work for Strings. But Scala-defined types usually don't extend Comparable.
- The implementation complexity is high. A
- Alternative to changing existing collections: we could introduce a new collection type that is used for
Map(...)
/Map.empty
andSet(...)
/Set.empty
- Again, this has an impact on performance of existing programs.
- We could not expose that new type (forwards binary compatibility). Users that want to ensure they use the new implementation have to rely on these factories doing the right thing, but cannot enforce it in the types.
- Same limited scope.
@retronym said (I hope not to twist his words) he thinks the issue is well-known, and programmers would select the right data strucutres in security-sensitive cases.
What would be our recommendations for 2.12?
- Use the Java implementations (with Scala wrappers?) for String keys?
- Should we backport the new implementations requiring an
Ordering
from 2.13 (once they exist) and release them as a separate library?
- Use the Java implementations (with Scala wrappers?) for String keys?
How would you do it in practice? Wrapping a mutable collection would mean copying the whole underlying map on each update?
Did you actually consider adding a limit to the number of collisions? I'd rather have guaranteed performance characteristics than degrading performance (and added complexity) just to cover adverse cases.
I guess it would need a bit of research finding a reasonable number of collisions to consider.
I didn't find a closed form formula that would give the probability that you would find at least one k-way-collision when sorting 2^31 elements (max size of collection) into 2^32 bins (hashCode gives 32bit values). I sorted 100M random numbers into 200M bins and found a few bins with 8 collisions. Obviously, that's not a proper experiment as Java's hashCodes are notoriously badly distributed. Java's HashMap chose length 8 before switching to a TreeMap, but probably more for the reason that linear scanning may just be cheaper before. If we would limit it e.g. to 100 (even hard-code that limit), we would have a bit of leeway for badly distributed hashCodes but would still avoid those really expensive cases.
You may find this limit arbitrary but I'd say it's not more arbitrary than the 2^31 size limit or other limits.
@retronym said (I hope not to twist his words) he thinks the issue is well-known, and programmers would select the right data strucutres in security-sensitive cases.
I think that has already been somewhat disproven with just this set of cases.
How would you do it in practice? Wrapping a mutable collection would mean copying the whole underlying map on each update?
This is the recommendation for mutable maps. The best we have in 2.12 for immutable maps is TreeMap
, which may or may not be an acceptable alternative performance-wise.
@SethTisue Shouldn't this be kept open until we at least have proper user documentation on which maps are "DoS-safe" and which aren't ?
I wouldn't be opposed to be re-opening it, but I don't think it's a blocker for 2.12.8. @lrytz?
Could reschedule to 2.12.9.
It's not blocking 2.12.8. I think @szeiger is working on a ordering-based implementation for 2.13?
I think @szeiger is working on a ordering-based implementation for 2.13?
scala/scala#7633 was merged and included in 2.13.0
perhaps a pull request backporting CollisionProofHashMap
to 2.12+2.11 would be accepted as an addition to https://github.com/scala/scala-collection-compat
@SethTisue I have created a ticket on behalf of you: scala/scala-collection-compat#234
Currently any Scala collection is vulnerable when affected maps or sets are used internally in methods like: toMap
, toSet
, keys
, keySet
, distinct
, groupBy
, --
, etc.