pescadores/pescador

Guaranteeing Mux statistics / behavior

Closed this issue ยท 17 comments

tl;dr: I think our Mux tests need to be more rigorous to make sure we're getting back the statistics one would expect from the parameters used to initialize the object.

I was looking into #79, and decided to test that this issue was resolved in the midst of a broader refactor. While things mostly worked, I observed some behavior I didn't immediately understand. Basically, if I ended up with a bad RNG state, I'd end up with two duplicate streams of one sub-mux, and nothing from the other.

Okay, user error -- increasing the oversampling (k=2 -> 10, max_iter=10->100) "fixes" it, but this is a dumb hack and so I tried to do the correct thing, i.e. follow the docstring:

If with_replacement is False, setting revive=Truewill re-insert previously exhausted streams into the candidate set. This configuration allows a stream to be active at most once at any time.

However, this also didn't give me the behavior I wanted (at least, not always). Checking the tests, it seems that test_mux_revive doesn't inspect the stream it gets back, and I'm suspicious that Mux might not always be doing the thing we expect.

I'm going to explore with more testing, but in parallel -- is this perhaps surprising or consistent with others' observations? @bmcfee, @cjacoby?

kay, few quick pre-bedtime findings:

  1. Mux-of-Mux issues seem to be an issue of repeat iteration.

For example, this causes problems:

abc = pescador.Streamer('abc')
xyz = pescador.Streamer('xyz')
mux1 = pescador.Mux([abc, xyz], k=2, rate=None, revive=True,
                    with_replacement=False, prune_empty_streams=False)
samples1 = mux1.iterate(max_iter=1000)  # Fine

n123 = pescador.Streamer('123')
n456 = pescador.Streamer('456')
mux2 = pescador.Mux([n123, n456], k=2, rate=None, revive=True,
                    with_replacement=False, prune_empty_streams=False)
samples2 = mux2.iterate(max_iter=1000)  # Fine

mux3 = pescador.Mux([mux1, mux2], k=2, rate=None,
                    with_replacement=False, revive=True,
                    prune_empty_streams=False)
samples3 = mux3.iterate(max_iter=1000)  # ...hangs?

But this is fine:

abc = pescador.Streamer('abc')
xyz = pescador.Streamer('xyz')
mux1 = pescador.Mux([abc, xyz], k=2, rate=None, revive=True,
                    with_replacement=False, prune_empty_streams=False)

n123 = pescador.Streamer('123')
n456 = pescador.Streamer('456')
mux2 = pescador.Mux([n123, n456], k=2, rate=None, revive=True,
                    with_replacement=False, prune_empty_streams=False)

mux3 = pescador.Mux([mux1, mux2], k=2, rate=None,
                    with_replacement=False, revive=True,
                    prune_empty_streams=False)
samples3 = mux3.iterate(max_iter=1000000)  # success!

In lieu of fixing this (or identifying what's up), it seems like best practice is "don't iterate over streamers prematurely."

  1. Global distributions seem fine.
>>> import collections
>>> print(collections.Counter(samples3))
Counter({'5': 83404, '4': 83404, '6': 83403, 'z': 83343, 'y': 83343, 'x': 83343, 
'a': 83305, 'b': 83305, 'c': 83304, '1': 83282, '3': 83282, '2': 83282})
>>> print(samples[:10])
['1', '4', 'x', 'a', '2', 'y', 'z', '5', '3', '6']

There are within-bag correlations, but note that this is because there's no shuffling in the iterables used to initialize the underlying streamers.

  1. Mux is pretty flexible, and we'll want to have some good examples of how to use / not use the parameters. I (mostly) know what I'm doing, and it seems easy to configure the mux with statistically bad settings.

Basically, if I ended up with a bad RNG state, I'd end up with two duplicate streams of one sub-mux,

This should only happen if you have with_replacement=True, in which case, this is expected behavior, not bad state.

As for your second post, it's gonna take me a while to digest.

I definitely have seen problems like this in the past month or so. ๐Ÿ‘ to identifying the problem with tests and then fixing the behavior.

Ditto what @bmcfee said on digesting. Gonna have to review that a couple times.

continuing on with my inquiry into the behavior of mux. I've added a test in a PR, but I'll reproduce the guts of it here for ease of reference:

def _choice(vals):
    while True:
        yield random.choice(vals)

ab = pescador.Streamer(_choice, 'ab')
cd = pescador.Streamer(_choice, 'cd')
ef = pescador.Streamer(_choice, 'ef')
mux1 = pescador.Mux([ab, cd, ef], k=1, rate=2,
                    with_replacement=False, revive=True)

gh = pescador.Streamer(_choice, 'gh')
ij = pescador.Streamer(_choice, 'ij')
kl = pescador.Streamer(_choice, 'kl')

mux2 = pescador.Mux([gh, ij, kl], k=1, rate=2,
                    with_replacement=False, revive=True)

mux3 = pescador.Mux([mux1, mux2], k=2, rate=None,
                    with_replacement=False, revive=True)
samples = list(mux3.iterate(max_iter=10000))
count = collections.Counter(samples)
max_count, min_count = max(count.values()), min(count.values())
assert (max_count - min_count) / max_count < 0.2
print(count)
assert set('abcdefghijkl') == set(count.keys())

Interestingly, the counter after 10k iterations looks like this:

Counter({'c': 880, 'g': 875, 'j': 874, 'h': 854, 'e': 842, 'l': 837, 
         'i': 835, 'k': 814, 'f': 809, 'b': 804, 'd': 791, 'a': 785})

which has a max discrepancy ( (max - min) / max) of just around 10%. As we'd hope, the sampling becomes more uniform as max_iter increases ... empirically, the bias goes down to 3% and 1% for 100k and 1M iterations, respectively.

I'm feeling more confident about this, but curious what folks think?

๐Ÿ‘

A while back (a month or two), I saw some weird behavior wrt this, but then I lost it, and I haven't run into it since. I made some experimental tests much like this at that time, and saw basically the same result, but I generally speaking would be pro having tests like this in the codebase. (Possibly with random seeds for reproducibility).

Not sure if I'm helping with your question.

so.... is the consensus then that we're fine here? I don't quite have the bandwidth to think about all the details here, but it seems good to me.

EDIT: for comparison purposes, it would be handy to have a baseline that uses a flat mux, all seeds active, no replacement, and no rate parameter, so that it's really just sampling uniformly from the entire pool. That way we can check the discrepancy in the convergence to uniform between the hierarchical mux and the flat mux.

I would say we're mostly there as of #88, but missing a the test you describe (which I don't fully grok?) and one that runs multiple streams run to exhaustion, verifying that all samples recovered.

Oh, I wasn't proposing a test, just a simplified comparison like the one you described above but with fewer moving parts.

"Definition of done" includes state-based testing, and inspecting internal mux states

After merging #100 , is there anything left for this wrt internal states? Or is that blocked by #94 ?

ya, i think we've gotten as far as we can without #94.

Ok, so should we close this out? Punt it up to 2.0?

Made the executive decision to punt the issue upward.

Update wrt #106 , I cooked up a notebook here to empirically measure the statistics (entropy) produced by a poissonmux. Everything seems to behave as expected -- I don't think we necessarily need to add unit tests here. What do yall think?

Perhaps we could create another repo in the pescadores team to contain things like this (regression tests, kinda)? Perhaps have it save actual numbers to a .json file as well, so you can compare current vs saved.

Since @ejhumphrey has chimed in on #94, and clarified that the desired functionality is really about modifying the weights in-flight on StochasticMux, I don't think #94 is a blocker on this anymore.

Revisiting why this issue is still open, I think the paper in progress #32 and the tests merged with #100 sufficiently address the points. The tests could always be expanded -- i don't think we need/want a separate repo for that -- but i don't think that should block this issue. Any objection to closing?

No objection