Fragile test of SeedSpecification

Question

Fragile test of SeedSpecification

ashawley opened this issue 4 years ago · 3 comments

I saw a curious failure on GitHub Actions that I hadn't seen before:

[info] failing seed for Seed.longs are evenly-distributed is -aQhCwV8oBUYlWwee46hAk7cDCysDENMKQKEw-FnwEC=
[info] ! Seed.longs are evenly-distributed: Falsified after 368 passed tests.
[info] > ARG_0: Seed.fromBase64("mngJYod-N9mPAhd861EV4YJQcckrt-vaCc8Eabb013H=")
[info] > ARG_1: Base(53)
[error] Failed: Total 8, Failed 1, Errors 0, Passed 7
[error] Failed tests:
[error]  org.scalacheck.rng.SeedSpecification

Answer 1 · 2020-11-06T17:21:25.000Z

Draft PR #711 shows that the seed fails.

I believe this is a probabilistic property test, so it's probably acceptable to fail rarely.

Answer 2 · 2020-11-07T04:59:37.000Z

I re-read the test. I think the predicate has a random 1-in-1.7M chance of failing each time we run it, but since we evaluate the forAll 100 times, we really have a 1-in-1700 chance of failing each time we run SeedSpecification. Since we tend to run CI over a lot of different targets this means we have more like a 1-in-170 (or so) chance of failing.

We can reduce the chances of false positives at the expense of test accuracy: right now we expect the random value we produce to be within 5 standard deviations of the expected result (https://en.wikipedia.org/wiki/Standard_deviation#Rules_for_normally_distributed_data).

If we used 6 standard deviations instead of 5, we'd expect the test to randomly fail only 1-in-506.8M (about 300x less often). We could also do two separate tests and require both to fail: requiring two consecutive failures at 4 standard deviations has a 1-in-249M chance of failing (1-in-15,787 x 1-in-15787).

Do you think we should change it? I'm fine closing your PR and this issue, or making that change (which could obscure real bugs but will also cut down on noise).

Answer 3 · 2020-11-10T14:19:03.000Z

I was just wondering if your probability test had a flaw or if the the issue was with the random number generator. A failure did seem to be super rare over the last 4 years as you described. I'm fine with closing this.