luntergroup/analysis

Pull Request Draft

Closed this issue · 0 comments

Hello everyone,

I'm a PhD student working with Gerton Lunter, and here I've added our algorithm smcsmc to the analysis pipeline. As the input format is similar to msmc, there is not a lot of additional overhead needed.

Details about the algorithm may be found here

I have been able to successfully run through the whole pipeline with five replicates of the Gutenkunst model, with comparable accuracy to msmc. Here's a plot with the Gutenkunst model in black, looking at the 2nd (European-acting) population. I ran this through for both 2 and 4 haploids samples.

image

There's a major caveat here that we have not yet publicly released our source code for smcsmc so I do not expect this to be able to be verified and merged until we do. However, this code is functional with our (private) package. The smcsmc code is packaged in conda so the executables needed (and the python frontend used here) are installed together such that the required dependencies are always available, so once we do release it, which should be in the very near future, it should be easy to run an identical analysis.

Changes Summary

  • Created a new set of rules relating to smcsmc essentially duplicating the rules for the msmc analysis with different scripts. The structure is the same where the overall rule is compound_smcsmc which creates a plot. I've added a guide to this plot, which is slightly different than msmc.
    • I've accordingly added fields to the config.json and put functional ones in the n_t/README.md
  • Added plotting scripts to n_t/plots.py for plotting replicates of smcsmc.
  • (Untested) Add smcsmc plots to plots.plot_all_ne_estimates. I haven't had a chance to run the entire pipeline so I haven't had a chance to test this yet.

Other

  • I've created a separate submodule within our python package for PopSim related code. This is probably best packaged here, but I haven't figured out the best way to do this yet. One of these functions is found in the Snakefile and the submodule is just smcsmc.popsim. Right now it's pretty basic.
  • Working with a list of haplotype numbers works fine (like MSMC) as smcsmc can do any number of haplotypes (though performance wise, 8 is a soft cap) but
  • I haven't tried this with any non-human simulations, nor have I adapted it for the 2 population case yet. These are on my TODO list.
  • I also fixed a very small bug in the Snakefile where the output was being put into the config file directory rather than the specified output directory in the config file. Just a typo.

As I said before, we aren't expecting this to be merged now, but we hope that submitting it now will give everyone some time to check it out. We hope to join in the next conference call and discuss this a little bit.

Thank you everyone,

Chris