dcwuser/metanumerics

K-S for 2 samples - possible issue?

themoabird opened this issue · 2 comments

Hi

I'm not sure if this is an issue, but I'm using the K-S test for 2 samples to examine the compatibility of samples.

I've found that if I use identical samples for sample a and sample b it sometimes tells me the samples are not compatible (i.e., it's a low probability that both samples are drawn from the same underlying distribution).

I don't know enough about how the K-S test works to have an idea about whether that makes any sense, but it's certainly counterintuitive...

Sorry if I'm just wasting your time by flagging up a non-issue!

Thanks so much for reporting this issue! This may be a real bug, but I'm actually not entirely sure; I'll need to go back and study some theory to make sure, and there is a little bit of follow-up from you that could be helpful.

Computing the 2-sample KS D-statistic involves measuring the maximum distance between two EDFs (https://en.wikipedia.org/wiki/Empirical_distribution_function). Because there is a step discontinuity in the EDF at each data point, you have a little bit of ambiguity in how to measure the distance between the two EDFs is at those points: should you measure from the bottom or top of the step? Since the D-statistic is defined as the maximum distance, I wrote the code so as to always resolve that ambiguity by returning the largest possible distance. Given two identical EDFs, that means we don't get D=0, but instead D=1/n, where n is the number of points. I need to go back and study the theory to see if this is the right choice.

Considering this, it's to be expected that you get D > 0 for two identical samples, but I am still surprised that you ever get a small P. I haven't been able to construct any example which yields a small P. Could you send me a repro?

Even if this does turn out to be a bug for the identical sample case, I wouldn't worry about the reliability of the method for real data. This behavior appears to be a perhaps undesirable effect of the distance definition for the corner case of identical data, but should have no impact with real, continuous data from separate samples.

Hi - Thanks for responding.

Try this dataset as both Sample A & Sample B.

18,15,18,16,17,15,14,14,14,15,15,14,15,14,22,18,21,21,10,10

It gives me D = 0.25, calling it like this:

var ksTest = Univariate.KolmogorovSmirnovTest(List1, List2),

where both lists are the same (obviously).

If you then duplicate it (i.e., same numbers duplicated in both samples), p gets smaller (as expected, I'd guess - because bigger sample size means less randomness, and D doesn't change). Duplicate it again, you eventually end up with p being non-significant

Isn't that inevitable if D > 0 (i.e., there would be some sample size at which even a very small D becomes significant, given that in this case D isn't changing)?

I need to caveat all this by saying, I really don't know what I'm doing, I have a very basic understanding of statistics, and it's also possible I messed up my programming, and D isn't 0.25 at all! :)

Thanks again!

Meta.Numerics is a very, very cool thing. :)