should post-SLiM simplify have `keep_input_roots=True`?

Question

should post-SLiM simplify have `keep_input_roots=True`?

Opened this issue 5 months ago · 10 comments

After we run SLiM, we simplify down to the requested samples, here:
https://github.com/popsim-consortium/stdpopsim/blob/main/stdpopsim/slim_engine.py#L1764

We do not pass in the keep_input_roots argument. This means that:

any neutral substitutions that happened during the SLiM portion of the simulation will not be present (as these are added by msprime post-simplification)
the tree sequence is maybe less good for some downstream applications? this is not so much of an issue because we already recapitate, so I can't actually think of any concrete applications really

Numbers of substitions are important to know sometimes, but these would be substitutions relative to the time we started running SLiM from which might not be the right reference time. So, it's not so clear to me.

Cons: this will make the resulting tree sequence bigger, and confuse people (since now the "roots" of the resulting trees won't actually be the roots-where-everything-coalesces).

Given the last point I'm inclined to say "no", or maybe make it optional.

Answer 1 · 2024-07-25T23:09:54.000Z

for context: @silastittes and I ran in to this trying to debug DFE inference results which required the number of neutral and non-neutral substitutions

Answer 2 · 2024-07-25T23:15:19.000Z

This is for my own understanding, but does that make the root length = total sim time minus each tree's TMRCA?

Answer 3 · 2024-07-26T00:06:15.000Z

This is for my own understanding, but does that make the root length = total sim time minus each tree's TMRCA?

Yup, that should be the case. So you could retroactively calculate the neutral substitution rate given the total simulation time + simplified trees. This would be something like,

exp_neutral_subs = 0.0
for t in ts.trees():
    if t.num_edges > 0:
       exp_neutral_subs += max(0, (slim_simulation_time - t.root) * t.span * neutral_mutation_rate)

though this would need to be modified slightly if the neutral mutation rate varied across the sequence, I guess. Assuming that the mutation rates are in an msprime.RateMap, you'd replace t.span * neutral_mutation_rate with neutral_ratemap.get_cumulative_mass(t.interval.right) - neutral_ratemap.get_cumulative_mass(t.interval.left)

Answer 4 · 2024-07-26T00:09:18.000Z

Given the last point I'm inclined to say "no"

That's my vote, too -- for one-off debugging it seems easier to modify a fork to get whatever's needed, rather than introduce another option + even more SLiM tests ...

Answer 5 · 2024-07-26T04:08:48.000Z

it's probably easiest for @silastittes and I to just calculate the expected value within the analysis2 pipeline presently, but I do think keep_input_roots would be a good option to have.

Answer 6 · 2024-07-30T19:12:16.000Z

Now that I understand the application, I think it's a good idea to be able to get number of subs on branch to an arbitrarily distant outgroup somehow. But, I'm not sure I think that a keep_input_roots argument is the way to do this, because it couples SLiM burnin and the divergence time of the outgroup. For example, if there's selection, then we'd want some burnin before the separation with the outgroup because otherwise the stochastic process generating substitutions will take some time to reach stationarity.

The ideal way would be to have the outgroup as a population in the model. But, that's not possible to enforce generally and seems really restrictive-- the desired divergence time to the outgroup might change depending on the application. An alternative would be to have an option to add an outgroup at an arbitrary time (e.g. a population with no migration to the other populations, a fixed size, and a divergence time that exceeds any of the demographic events in the model).

So for example,

model = HomSap.get_demographic_model("whatever")
model.add_outgroup(name="some_hominid", time=long_long_ago, size=not_very_big)
samples = {..., "some_hominid" : 1}
engine.simulate(demographic_model=model, samples=samples, ...)

The downside is that this increases computational overhead? But that could be largely avoided by setting the outgroup population size to 1.

Answer 7 · 2024-07-30T22:45:00.000Z

this is probably the way to go, but would need a pretty major refactor for the code. I think the thing to do is to shelve this for the coming paper/release and revisit at a later time.

Answer 8 · 2024-07-30T22:58:39.000Z

I think we'd just need to add a method to DemographicModel which wouldn't have that big a footprint, but it'd definitely require a lot of testing and back-and-forth (and I'm sure there's issues I'm not thinking of); so yeah not going to happen quickly.

Answer 9 · 2024-07-30T23:06:11.000Z

My naive (and almost certainly wrong) take on this is that the arbitrary decision of where to place the root time is all that matters. The length of that branch is what dictates the number of substitutions right? Does there need to be any information about the outgroup samples at all for this purpose? I agree that being forced to use the slim burn in time doesn’t make sense, and is demonstrably too short in some cases.

Answer 10 · 2024-12-20T23:13:05.000Z

The proposal is that we just document this for the time being: #1624