talent-plan/tinysql

Possibly mismatched zipfFactor and avgError in statistics/cmsketch_test.go

czt1999 opened this issue · 1 comments

Hi!

While struggling with TestCMSketch, I found an interesting fact that values generated with lower zipfFactor are more dispersive (thus lead to more collision) but with lower expected avgErr. This can be inferred by the definition of Zipf distribution.

I have passed case 2 (zipfFactor = 2) and case 3 (zipfFactor = 3), and the avgErrs are all 0. I am sure that hashing results for different rows are independent and I have tried many groups. However, these different groups of hasher all failed in case 1 (zipfFactor = 1.1) with avgErr 10, which really upsets me :)

The numbers of distinct values generated in terms of different zipfFactor are as follows:

zipfFactor: 1.1  len(lMap): 23965
zipfFactor: 2  len(lMap): 420
zipfFactor: 3  len(lMap): 58

It seems that lower zipfFactor should be coupled with higher avgErr. And the incorrect arrErr 10 I got in case 1 looks reasonable giving width = 2048. Is that wrong? I really need some advice.

Thanks!

My miss!

TiDB utilizes Count-Mean-Min Sketch rather than original CMS, which is mentioned in the end of this blog. So it is a stupid question due to my neglecting :)