Possibly mismatched zipfFactor and avgError in statistics/cmsketch_test.go
czt1999 opened this issue · 1 comments
Hi!
While struggling with TestCMSketch
, I found an interesting fact that values generated with lower zipfFactor
are more dispersive (thus lead to more collision) but with lower expected avgErr
. This can be inferred by the definition of Zipf distribution.
I have passed case 2 (zipfFactor = 2) and case 3 (zipfFactor = 3), and the avgErrs are all 0. I am sure that hashing results for different rows are independent and I have tried many groups. However, these different groups of hasher all failed in case 1 (zipfFactor = 1.1) with avgErr 10, which really upsets me :)
The numbers of distinct values generated in terms of different zipfFactor
are as follows:
zipfFactor: 1.1 len(lMap): 23965
zipfFactor: 2 len(lMap): 420
zipfFactor: 3 len(lMap): 58
It seems that lower zipfFactor
should be coupled with higher avgErr
. And the incorrect arrErr
10 I got in case 1 looks reasonable giving width = 2048. Is that wrong? I really need some advice.
Thanks!
My miss!
TiDB utilizes Count-Mean-Min Sketch rather than original CMS, which is mentioned in the end of this blog. So it is a stupid question due to my neglecting :)