chad-m/head_tail_breaks_algorithm

Why 0.4?

YYsong opened this issue · 1 comments

I have a question after reading this paper and your code.
I can't find any mathematical definition for a heavy-tailed distribution in the paper.
But there is a picture in page 3 illustrates only 10 percent data values in the head.
So why do you want to set the threshold to 0.4?

thx

See page 4 of the article for a discussion about the mathematics of heavy-tailed distributions. If that is not sufficient, see these references:

===========================================
Regarding the 40% rule of thumb, the figure on page 3 that you referenced is regarding a discussion about heavy-tailed distributions in general and how they compare to more traditionally used distributions (e.g. Gaussians), not about the implementation details of the algorithm.

The 40% rule is a simplification of the actual algorithm that works well in practice. The algorithm stops when the head group is no longer characterized by a heavy-tailed distribution. So, if 40% or so of the data is in the head (60% in the tail) after the split, then the data are most likely not heavy-tailed distributed. Though you could certainly make this much more precise, it often does not matter in practice.