beware of recapitation and missing data
petrelharp opened this issue · 1 comments
A thing that caught me by surprise: now, if you simulate a chunk of a contig, you'll get 'missing data' outside that contig. This is represented as, basically, "no trees there". However, this will have ts.first().num_roots == ts.num_samples
(and this is the reason for the root_threshold
argument). This is fine and expected, if confusing - so maybe needs some documentation.
A particular gotcha is that if someone notices the ts doesn't look recapitated (although it is!), and recapitates again with pyslim.recapitate(ts, ...)
, then this will wrongly simulate coalescent history on the bits of chromosome that should be missing. I'm not sure if there's anything we can do to prevent this besides some documentation.
Example:
import stdpopsim
L = 1e6
engine = stdpopsim.get_engine("slim")
species = stdpopsim.get_species("CanFam")
contig = species.get_contig('1', left=1e7, right=1e7 + L)
model = stdpopsim.PiecewiseConstantSize(species.population_size, (100, 100))
samples = {"pop_0": 100}
ts = engine.simulate(model, contig, samples, slim_burn_in=100)
print([t.num_roots for t in ts.trees()])
gets
$ python test.py
[200, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 200]