fraenkel-lab/OmicsIntegrator

Forest reproducibility with seed

Closed this issue · 2 comments

Running Forest can unexpectedly produce different results on different hardware, even when a seed for the pseudo-random number generator is specified and msgsteiner is run with a single thread. At this time, the root cause seems to be:

  • Forest sends input to msgsteiner in an arbitrary order. When run with the same seed on the same hardware, the order in which the edge and prize node dictionaries are iterated over tends to be the same. However, running Forest on different machines changes the iteration order.
  • msgsteiner is sensitive to the order of the input data. I have not verified this in the code, but most likely the order of the input data affects how the graphical model is constructed and the order in which messages are passed. Our hypothesis is that msgsteiner will produce identical output if the seeds are set, the input data order is fixed, and a single thread is used.

@jpgulliver has a branch with a potential fix (c92f721) but we need to test it further before creating a pull request. The change simply sorts the msgsteiner input file so that the order is deterministic and has Forest sort the output when writing files for Cytoscape.

The tests pass on additional hardware so we will prepare a pull request soon. This will also partially address issue #1 by introducing a test framework for Forest.

Pull request #9 fixes this issue