should post-SLiM simplify have `keep_input_roots=True`?
Opened this issue · 10 comments
After we run SLiM, we simplify down to the requested samples, here:
https://github.com/popsim-consortium/stdpopsim/blob/main/stdpopsim/slim_engine.py#L1764
We do not pass in the keep_input_roots
argument. This means that:
- any neutral substitutions that happened during the SLiM portion of the simulation will not be present (as these are added by msprime post-simplification)
- the tree sequence is maybe less good for some downstream applications? this is not so much of an issue because we already recapitate, so I can't actually think of any concrete applications really
Numbers of substitions are important to know sometimes, but these would be substitutions relative to the time we started running SLiM from which might not be the right reference time. So, it's not so clear to me.
Cons: this will make the resulting tree sequence bigger, and confuse people (since now the "roots" of the resulting trees won't actually be the roots-where-everything-coalesces).
Given the last point I'm inclined to say "no", or maybe make it optional.
for context: @silastittes and I ran in to this trying to debug DFE inference results which required the number of neutral and non-neutral substitutions
This is for my own understanding, but does that make the root length = total sim time minus each tree's TMRCA?
This is for my own understanding, but does that make the root length = total sim time minus each tree's TMRCA?
Yup, that should be the case. So you could retroactively calculate the neutral substitution rate given the total simulation time + simplified trees. This would be something like,
exp_neutral_subs = 0.0
for t in ts.trees():
if t.num_edges > 0:
exp_neutral_subs += max(0, (slim_simulation_time - t.root) * t.span * neutral_mutation_rate)
though this would need to be modified slightly if the neutral mutation rate varied across the sequence, I guess. Assuming that the mutation rates are in an msprime.RateMap
, you'd replace t.span * neutral_mutation_rate
with neutral_ratemap.get_cumulative_mass(t.interval.right) - neutral_ratemap.get_cumulative_mass(t.interval.left)
Given the last point I'm inclined to say "no"
That's my vote, too -- for one-off debugging it seems easier to modify a fork to get whatever's needed, rather than introduce another option + even more SLiM tests ...
it's probably easiest for @silastittes and I to just calculate the expected value within the analysis2
pipeline presently, but I do think keep_input_roots
would be a good option to have.
Now that I understand the application, I think it's a good idea to be able to get number of subs on branch to an arbitrarily distant outgroup somehow. But, I'm not sure I think that a keep_input_roots
argument is the way to do this, because it couples SLiM burnin and the divergence time of the outgroup. For example, if there's selection, then we'd want some burnin before the separation with the outgroup because otherwise the stochastic process generating substitutions will take some time to reach stationarity.
The ideal way would be to have the outgroup as a population in the model. But, that's not possible to enforce generally and seems really restrictive-- the desired divergence time to the outgroup might change depending on the application. An alternative would be to have an option to add an outgroup at an arbitrary time (e.g. a population with no migration to the other populations, a fixed size, and a divergence time that exceeds any of the demographic events in the model).
So for example,
model = HomSap.get_demographic_model("whatever")
model.add_outgroup(name="some_hominid", time=long_long_ago, size=not_very_big)
samples = {..., "some_hominid" : 1}
engine.simulate(demographic_model=model, samples=samples, ...)
The downside is that this increases computational overhead? But that could be largely avoided by setting the outgroup population size to 1.
this is probably the way to go, but would need a pretty major refactor for the code. I think the thing to do is to shelve this for the coming paper/release and revisit at a later time.
I think we'd just need to add a method to DemographicModel
which wouldn't have that big a footprint, but it'd definitely require a lot of testing and back-and-forth (and I'm sure there's issues I'm not thinking of); so yeah not going to happen quickly.
My naive (and almost certainly wrong) take on this is that the arbitrary decision of where to place the root time is all that matters. The length of that branch is what dictates the number of substitutions right? Does there need to be any information about the outgroup samples at all for this purpose? I agree that being forced to use the slim burn in time doesn’t make sense, and is demonstrably too short in some cases.