Benchmark Issue: zipf distribution has itself locality
Closed this issue · 0 comments
JigaoLuo commented
"Locality"
The zipf distribution (https://en.wikipedia.org/wiki/Zipf%27s_law) has a hotspot at the begin of the value range.
Example:
- check the figure in the wiki.
- a generated sequence from the zipf law after sorting would be something like: 1,1,1,1,1,2,2,2,3,3,4,5,6
With Locality, I mean, the zipf distribution is heavily biased with several first numbers of the value range.
Example:
- Values 1,2,3,4 take ~20% of a sequence generated by zipf factor 1 in [1, 1M].
However, it is highly possible, the values 1,2,3,4 are stored in an ART Node4.
The native bais of zipf law brings an additional locality into the ART index.
Easy Fix
/// lookup_keys is declared as: uint64_t* lookup_keys=new uint64_t[n];
/// After the declaration, lookup_keys is filled with zipf distribution
/// An example of lookup_keys: 1,1,2,1,1,1,2,2,1,3,2,3
std::sort(lookup_keys, lookup_keys + n);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> uni_distrib(1, n);
for (uint64_t i=0;i<n;) {
const uint64_t before = lookup_keys[i];
const uint64_t after = static_cast<uint64_t>(uni_distrib(gen));
while (lookup_keys[i] == before) {
lookup_keys[i] = after;
++i;
}
}
std::random_shuffle(lookup_keys, lookup_keys + n);
/// An example of lookup_keys: 6,6,4,6,6,6,4,4,6,10,4,10
After the fix, the locality is removed with the zipf hotness kept. In another word, each zipf hotness value is independent.