Repeated multiprocess simulate produces the same answers every time
Closed this issue · 11 comments
Test program:
import struct
import bayeslite
seed = struct.pack('<QQQQ', 0, 0, 0, 2)
bdb = bayeslite.bayesdb_open("foo.bdb", seed=seed)
bdb.sql_execute("CREATE TABLE IF NOT EXISTS test_t (x)")
bdb.sql_execute("INSERT INTO test_t VALUES (1)")
bdb.execute("CREATE POPULATION test_p FOR test_t WITH SCHEMA (MODEL x AS NUMERICAL)")
bdb.execute("CREATE METAMODEL test_m FOR test_p WITH BASELINE crosscat")
bdb.execute("INITIALIZE 1 MODELS FOR test_m")
for _ in range(10):
print bdb.execute("SIMULATE x FROM test_p GIVEN rowid = 1 LIMIT 3").fetchall()
Doing this produces output like this:
[(3.0227014789924476,), (-0.5771918878776363,), (-0.059509389533395185,)]
[(3.0227014789924476,), (-0.059509389533395185,), (-0.5771918878776363,)]
[(3.0227014789924476,), (-0.5771918878776363,), (-0.059509389533395185,)]
[(-0.059509389533395185,), (-0.5771918878776363,), (3.0227014789924476,)]
[(-0.059509389533395185,), (3.0227014789924476,), (-0.5771918878776363,)]
[(-0.059509389533395185,), (-0.5771918878776363,), (3.0227014789924476,)]
[(-0.059509389533395185,), (3.0227014789924476,), (-0.5771918878776363,)]
[(3.0227014789924476,), (-0.059509389533395185,), (-0.5771918878776363,)]
[(-0.5771918878776363,), (3.0227014789924476,), (-0.059509389533395185,)]
[(-0.059509389533395185,), (3.0227014789924476,), (-0.5771918878776363,)]
Note that, while the order of the outputs varies, the actual numbers present are always the same. I would expect repeated simulate queries to produce different answers.
Useful further information:
- Changing the initial seed changes what the numbers are, but does not change the pattern that they are always the same
- If one gets rid of the default metamodels and instead installs a non-multiprocess version of CGPM, the simulation results become different from one another, so that constitutes a workaround
- By adding debug prints, I narrowed the problem down to
cgpm.crosscat.engine.Engine.simulate
: the_set_seeds()
method appears to produce different seeds on each invocation, as would be expected, but then the returned samples are still all the same (and, at this point, in a consistent order, so their reordering must occur elsewhere).
I'll bet if you turn off multiprocessing with bdb.metamodels['cgpm'].set_multiprocess(False)
the samples will no longer be identical.
Cause of the isseu: _set_seeds
sets the seed of each crosscat.State in the Engine, but the Dim and View objects inside the state do not have their rngs updated recursively (a bug). When calling engine.simulate
with multiprocess, the rngs of the crosscat.State objects in the child process will have their rngs modified, but this rng is not pushed back back up to the master process, only the samples themselves are returned (which was the original intention of doing the expensive set_seeds
operation, so that each crosscat.State has a fresh rng on calling simulate).
Solution 1: Rather than update the state.rng
attribute here call a function state.set_rng
which will take care to propogate the rng through all the nested cgpms that comprise the state.
Solution 2: Make engine._evaluate
return a tuple of the answer plus the new rng, to avoid creating rngs in the Engine over and over.
Stupid question: Why do the Dim and View objects store their own rngs at all?
Separately from that, solution 3 would be to refactor State, Dim, and View to each store not an rng, but a reference to a mutable box containing an rng, such that they are all shared and can be efficiently mutated in O(1) by _set_seeds. Down side: shared mutable state.
Stupid question: Why do the Dim and View objects store their own rngs at all?
It's not a stupid question. Basically Dim and View objects may (and often do) exist in isolation, without the assumption that Dim is embedded in View, and View is embedded in State (which holds the rng). I had two choices when deciding how to supply entropy to Dim/View (and even State); one option was to give each method which needs randomness an explicit rng
or seed
parameter, and the second option was to supply an rng
object in __init__
. I selected latter option, since it made the interface easier and less expectations on the client to give an rng on a per-call basis.
Interesting. I regularly find myself making the opposite choice for the same reason -- including the rng in the interface makes it clear which methods might be random, and eliminates a piece of mutable state. Which mutable state is the cause of the present difficulty.
I agree, and I probably would have done the same as your choice implementing rng as a parameter had I been making this choice a second time round.
A third option is to provide each CGPM with a reference to an rng e.g. in a box.
I highly recommend passing independent seeds, rather than RNG boxes, into the interfaces in question. No need to coordinate state (which may be tricky if subprocesses are involved!); subtrees can be recomputed independently; simpler interface itself, at the cost of a modicum of extra work at either side of the interface to draw a seed and seed an RNG.
Alright, how should the function use the seed? Certainly it should not create a new rng object using that seed? I guess it would set the seed of an embedded rng using
Is setting the seed an expensive operation?
In a good RNG like https://mumble.net/~campbell/python/weakprng.py (which we use in some places) setting the seed is practically free.
In numpy.random it's not free, but unless it's observed to be a bottleneck we just do that anyway. (Mostly it has not been a bottleneck, but there were a couple cases where we just deferred creating the RNG until we proved we needed it, and that made the bottleneck go away.) Can't imagine it costs anywhere near what fork does anyway, if you're doing it across a process boundary.
The fix for this issue #217 is therefore to change
https://github.com/probcomp/cgpm/blob/20170124-fsaad-engine-seed/src/crosscat/engine.py#L243
to state.rng.seed(n)
, which will propagate down to all sub-cgpms recursively without any additional work since the rng is a shared object and essentially a box itself.
A separate ticket should be filed for refactoring the interface to accept seed arguments