Worm Hashing - an alternative to Open Addressing [HPPC-176]
Closed this issue · 33 comments
As hash collision resolution method we propose an alternative to the classical Open Addressing (linear scan): Worm Hashing (hybrid).
Worm hashing takes more computation during the insertion of an entry (put()) to keep a chain of hash collisions, stored as an array. It is both primitive and cache friendly. The goal is to have faster entry access (get()) and higher load factor for the map.
Compared to HPPC IntIntHashMap (below 2M entries):
- put() is 50% slower
- get() is 48% faster
- 11% less average memory usage
Basically this means it is more efficient if there are more get() than put(), and if there are less than 2M entries.
A presentation is attached to present a benchmark to compare with various primitive libraries (HPPC, FastUtil, Koloboke, Trove)
We have an implementation of WormMap production ready, with randomized tests, but not HPPC compatible. We would like to discuss if this could be an interesting addition to HPPC (we believe so :)). In this case would could work to make it HPPC compatible.
Issue: HPPC-176 (migrated from JIRA), created by Bruno Roustant (@bruno-roustant), resolved Aug 12 2020
Attachments: Screen Shot 2020-06-22 at 11.00.22.png, Screen Shot 2020-06-22 at 11.00.52.png, set_bench_contains_16_22.csv, Worm Hashing.pdf, WormMap intermediate benchmark.pdf
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
Hi Bruno. ;) Sure this is interesting! If you have an implementation that could be converted into the current template generator then it could become an alternative implementation of an associative container? I'll take a look at the presentation over the weekend; I recall experimenting with cuckoo but it wasn't too effective, so I'm curious about the improvements you made!
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
Just to let you know - haven't forgotten about it. Things piled up unexpectedly and pushed off some of the tasks, including this one. I'll get back.
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
Read the paper - very interesting! I do have some follow-ups:
- Is the load factor provable in adversarial conditions (a stream of keys with constantly conflicting hash values, for example)? Then your chains would just keep growing and growing. I wonder if a random mixing seed would be needed to make this less likely.
- Compared to linear hashing there is an added memory cost of the chaining array (byte[]). While for small hash sizes this is not a problem, for very large sizes it becomes a more significant overhead (for a 75% load capacity of a nearly full int range the memory required would be 25% more than with plain linear addressing). For this reason alone I think this shouldn't be a replacement for the current implementation but an alternative one (so that one can pick it depending on the scenario of use).
- I'd really like to test the current implementation you have in a realistic, non-benchmark setting. Is the implementation available in Elasticsearch somewhere? I'd love to port it to the template but it'd be good to start with some tests before work is put into it.
Comment by Bruno Roustant (@bruno-roustant) (migrated from JIRA)
Is the load factor provable in adversarial conditions?
The load factor depends on the actual stream of keys. If the stream of keys has constantly conflicting hash values, then the probability of not finding a free bucket in the chaining range is higher. So in this case the map capacity will enlarge earlier.
I agree that a random mixing seed would prevent this effect. Also in the general case, a good mixing function is nice (maybe less necessary than regular open addressing, but still important).
Compared to linear hashing there is an added memory cost of the chaining array
Worm hashing uses an additional byte array of the same capacity. This means for an IntIntHashMap, for each entry an open addressing map will need 8 bytes, while worm hashing will need 9 bytes (+12.5%). However in normal case worm hashing enlarges later than 75% load factor, rather above 82%. So we computed the average memory usage, a weighted average on a power of 2 range of size, counting the (memory-size-before-enlargement x size-range-before-threshold + memory-size-after-enlargement x size-range-after-threshold) / (size-range-before-threshold + size-range-after-threshold). And the average memory size is slightly in favor of worm map.
So, yes it is possible to find a specific number of entries for which worm hashing requires more memory, but it is also possible to find the inverse. The average memory size gives a good idea in my opinion.
(I didn't fully understand your example with 25% difference)
The additional byte array has a perf impact for a large number of entries. In our benchmarks we saw that below 2-3M entries, the fast memory handles seamlessly a third array, and worm map is effective. Above 2-3M entries we start seeing a degradation, while single array implementation Koloboke starts to shine. Interestingly Koloboke single array suffers in the 0.7-2M range compared to others, probably because its array becomes too large to fit in the fast memory.
So yes I think it is an alternative option, but not for memory but rather for performance due to the 2-3M size barrier.
I'd really like to test the current implementation you have in a realistic, non-benchmark setting.
Sure. For info our benchmark setting is fully random. We also tried some less randomly distributed streams of keys, with clustering effect, but the very nice mix method (in FastUtil and Koloboke) makes then actually well distributed. So I don't expect a significant difference.
Aleksandr is currently implementing the template for a WormHashMap. It will be used to run the benchmarks.
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
bq. The load factor depends on the actual stream of keys. If the stream of keys has constantly conflicting hash values, then the probability of not finding a free bucket in the chaining range is higher.
One particular case when this was a problem in the past was when you copy entries from one map to another - then any hash function produced a series of always-conflicting entries. This is the original idea behind per-instance random remix.
bq. (I didn't fully understand your example with 25% difference)
I'm talking about overall memory allocation and was thinking of a set of ints (sorry - this was confusing). For a large table the open addressing one would consume 4*2^N (for a set) bytes, whereas this algorithm would require roughly (4 + 1)*2^N (assuming similar load factor). For an int-int map this goes down proportionally, sure.
bq. Above 2-3M entries
Yeah - we use a lot larger associative arrays than this so I'd be curious to see and report differences in real-world data!
bq. Aleksandr is currently implementing the template for a WormHashMap.
That's awesome! Let me know if you need any help (the code is arguably a bit dated by now).
Comment by Bruno Roustant (@bruno-roustant) (migrated from JIRA)
I like your explanation of the map copy problem in the javadoc of HPPC. It's clear.
real-world data
For the lot larger associative arrays, I would definitely use Koloboke based on our benchmarks, or develop a new single-array version of HPPC map. Although I'd like to try to adapt Worm hashing to support this, in the future when I have time...
About real-world data, it does not necessarily mean larger sizes. In my company we mainly see uses of medium sized maps (less than a couple of millions entries) but still needing fast lookup because they are intensive.
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
We (Carrot Search) use HPPC because it's our toy. I know Koloboke and am a big fan but, you know, HPPC is ours. ;)
By real-world data applications I meant Lingo4G - we use large data structures for indexing and analysis of large data sets so it'd be more telling (to me) how this structure works if I could swap out the default implementation and re-run our software using worm maps.
Comment by Aleksandr Danilin (migrated from JIRA)
Hello Dawid, I am the second person who works on Worm Map. Currently, I am trying to use your template generator to generate all key-value map combinations.
I have faced a problem when I try to generate sources using $: mvn clean verify. Also, I have tried -X and -e flags.
But I keep getting java.lang.NullPointerException and the stack trace doesn't have any info about what is causing this exception.
Is there any way to check why does it happen? Like logs that tell at what line error happened and what is wrong with it.
Thank you in advance.
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
I just tried with JDK 8 and JDK 14 and it worked for me just fine. Can you share that stack trace (or full maven debug log?).
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
Which Java are you using (java -version)? I'm curious because this happens while walking the code tree which we have no control of (antlr generated code).
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
Or maybe the other way around: can you share that template class you've implemented that causes this exception?
Comment by Aleksandr Danilin (migrated from JIRA)
Maven uses JDK version is 11.0.7
And here is a link to the template: https://github.com/Svistoplyaz/hppc/blob/master/hppc/src/main/templates/com/carrotsearch/hppc/KTypeVTypeWormMap.java
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
I've corrected a bug in template generator to cater for abstract methods (no body caused NPE). I get other errors when trying to compile the generated code but didn't look into what's causing them – feel free to take a look.
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
Compiles if I apply these fixes:
diff --git a/hppc/src/main/templates/com/carrotsearch/hppc/KTypeVTypeWormMap.java b/hppc/src/main/templates/com/carrotsearch/hppc/KTypeVTypeWormMap.java
index 7e36b4b..45663d0 100644
--- a/hppc/src/main/templates/com/carrotsearch/hppc/KTypeVTypeWormMap.java
+++ b/hppc/src/main/templates/com/carrotsearch/hppc/KTypeVTypeWormMap.java
`@@` -1,15 +1,12 `@@`
/*! #set($TemplateOptions.ignored = ($TemplateOptions.isKTypeAnyOf("DOUBLE", "FLOAT", "BYTE"))) !*/
package com.carrotsearch.hppc;
-import com.carrotsearch.hppc.cursors.KTypeCursor;
-import com.carrotsearch.hppc.cursors.KTypeVTypeCursor;
-import com.carrotsearch.hppc.predicates.KTypePredicate;
-import com.carrotsearch.hppc.predicates.KTypeVTypePredicate;
-import com.carrotsearch.hppc.procedures.KTypeProcedure;
-import com.carrotsearch.hppc.procedures.KTypeVTypeProcedure;
-
import java.util.*;
+import com.carrotsearch.hppc.cursors.*;
+import com.carrotsearch.hppc.predicates.*;
+import com.carrotsearch.hppc.procedures.*;
+
/**
* A hash map of <code>KType</code> to <code>VType</code>, implemented using open
* addressing with linear probing for collision resolution.
`@@` -204,7 +201,7 `@@` public class KTypeVTypeWormMap<KType, VType>
}
public VType noValue() {
- /*! #if ($TemplateOptions.KTypeGeneric) !*/
+ /*! #if ($TemplateOptions.VTypeGeneric) !*/
return null;
/*! #else return 0; #end !*/
}
Comment by Aleksandr Danilin (migrated from JIRA)
Thanks, in noValue it is such a lame mistake of mine.
And after I have pulled your fix for template generator everything is generating just fine, thank you:)
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
No worries at all. It really should be simpler than it is - maven based built in particular is kind of hacky with the generated sources and all. Let me know if you run into any other problems.
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
Let me know when you have it ready, Aleksandr. I'm curious how it performs (in our code/ use cases).
Comment by Aleksandr Danilin (migrated from JIRA)
Hello, I have a made pull request, so you can check it out.
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
I've cleaned up the build and it works for me fine in IntellIJ, finally. I also pulled in your latest branch and pushed it as a 'wormmap' branch (rebased on top of gradle build). Can you take a look and see if this improves your workflow?
I'll be making some more adjustments but my time is really limited, as you can probably tell.
Comment by Aleksandr Danilin (migrated from JIRA)
That's great news. I am going to look at the changes.
And the thing is that Bruno is on a vacation for at least this week, so he cannot take a look now. He has made a pull request to my fork with some changes in the map(so there might be some conflicts with your changes) and added Worm Set.
He found that in terms of contains method the set performs generally faster than HPPC, FastUtils, and Koloboke sets. He wants me to review his pull request and add it to the pull request to the original HPPC repository and I don't know if you need it because I have never asked about sets. I have added a file with benchmark results so you can take a look. set_bench_contains_16_22.csv
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
Vacation sounds great. :)
If you can create a pull request against the new master (or create a pull request against the "wormmap" branch) then it'd be perfect for me. Both sets and maps are fine, I guess.
Comment by Aleksandr Danilin (migrated from JIRA)
Hello, I have done everything a week ago and I am not sure if you have seen all the changes in the pull request. So I am just notifying you here:)
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
Branch integrated with master.
Comment by Bruno Roustant (@bruno-roustant) (migrated from JIRA)
Thanks Dawid!
Yes we are confident with the current implementation (btw I added an additional randomized testing in the test class to be confident). I hope Aleksandr can share its benchmark report with you at some point so you can get more info.
- I'd like to discuss a couple more points about the random seed for hashing. What is the best medium to discuss that? A new Jira issue?
- Ideally I'd like to add the Set template for this Worm hashing. I'm busy right now (just coming back from vacation) but I'll try to continue on that soon. Do you think it is required for the next release?
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
No problem at all, my pleasure. As for discussion - Jira is convenient because it has all the history... But feel free to create a github issue if this is easier for you. An implementation of Sets would be welcome too, of course. I don't think it's required for the release but there is no time pressure either...
I'm thinking about ways to somehow trim that 2MB JAR... maybe I'm paranoid though and it doesn't make sense to do it. Will have to give it a thought.
Comment by Bruno Roustant (@bruno-roustant) (migrated from JIRA)
I updated hppc master with a fresh new clone. I tried to open the hppc project with IntelliJ (ultimate 2020.1.3) (opening by selecting the root "hppc" folder) and IntelliJ does not recognize source classes (it shows the .java files). It misses modules definitions.
INSTALL.txt says it should open without additional step. I must be missing something. How should we open the new gradle project?
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
I use intellij. Nothing special is needed. Just import as Gradle project and it works for me. Clean and retry? Maybe something went wrong along the way?
Comment by Bruno Roustant (@bruno-roustant) (migrated from JIRA)
Oh the Gradle plugin in my IntelliJ was disabled. It fixed the issue.
I can run tests in IntelliJ. However, when I try to run a benchmark I get the following error for JMH: "ERROR: Unable to find the resource: /META-INF/BenchmarkList"
Comment by Dawid Weiss (@dweiss) (migrated from JIRA)
You mean - run the benchmark directly from the IDE? I wouldn't do it. JMH does magic things behind the scenes and IDE will interfere with benchmarking anyway. Better to build the JAR and run it from console. See INSTALL.txt
Comment by Bruno Roustant (@bruno-roustant) (migrated from JIRA)
Thanks, it works fine. I like this new Gradle build speed very much, and the doc for running the benchmark jars!
Note: INSTALL.txt says to build benchmarks jar use
./gradlew benchmarks
but it's
./gradlew benchmark