JigaoLuo/duckdb

Benchmark Issue: zipf distribution has itself locality

Closed this issue · 0 comments

"Locality"

The zipf distribution (https://en.wikipedia.org/wiki/Zipf%27s_law) has a hotspot at the begin of the value range.
Example:

  • check the figure in the wiki.
  • a generated sequence from the zipf law after sorting would be something like: 1,1,1,1,1,2,2,2,3,3,4,5,6

With Locality, I mean, the zipf distribution is heavily biased with several first numbers of the value range.
Example:

  • Values 1,2,3,4 take ~20% of a sequence generated by zipf factor 1 in [1, 1M].

However, it is highly possible, the values 1,2,3,4 are stored in an ART Node4.
The native bais of zipf law brings an additional locality into the ART index.

Easy Fix

    /// lookup_keys is declared as: uint64_t* lookup_keys=new uint64_t[n];
    /// After the declaration, lookup_keys is filled with zipf distribution
    /// An example of lookup_keys: 1,1,2,1,1,1,2,2,1,3,2,3

    std::sort(lookup_keys, lookup_keys + n);
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> uni_distrib(1, n);
    for (uint64_t i=0;i<n;) {
        const uint64_t before = lookup_keys[i];
        const uint64_t after = static_cast<uint64_t>(uni_distrib(gen));
        while (lookup_keys[i] == before) {
            lookup_keys[i] = after;
            ++i;
        }
    }
    std::random_shuffle(lookup_keys, lookup_keys + n);

   /// An example of lookup_keys: 6,6,4,6,6,6,4,4,6,10,4,10

After the fix, the locality is removed with the zipf hotness kept. In another word, each zipf hotness value is independent.