The posseleff_simulations
repository is designed to assess the impact of
positive selection on Identity by Descent (IBD)-based inferences, leveraging
population genetic simulation and true IBD methodologies. The pipeline begins by running
simulations to generate tree sequences, allowing for both single and multiple
population models for various analyses. The selection simulation can be tailored
with parameters such as the selection coefficient, number of origins of the
favored mutation, selection starting time, and high-relatedness simulation
options. From there, the pipeline involves calling true IBD segments from the
tree sequence, followed by IBD processing for generating input files, selection
correction, and calling IBDNe and Infomap for Ne and population structure
inference. The pipeline is highly configurable, accommodating different
scenarios and requirements, and produces detailed output for further analysis.
Installation and execution instructions are provided, as well as options for
customization, making it a versatile tool for many IBD-related analyses.
- Run simulation and generate tree sequence
- Two demographic models
- Single population model for IBD distribution and Ne analysis, and
- Multiple population model for population structure analysis
- Selection simulation:
- Each of the above models allows for positive selection simulation
- Tunable Simulation parameters include
- selection coefficient
- number of origins of the favored mutation
- selection starting time (generations)
- High-relatedness/inbreeding simulations
- support inbreeding modeling via three strategies:
- shrinking population size
- positive assortative mating
- selfing
- see
simulations/Readme.md
for examples.
- support inbreeding modeling via three strategies:
- Two demographic models
- Call true IBD segment from tree sequence
- IBD processing and selection correction
- IBD processing for generating input files for IBDNe (Ne estimation)
- IBD processing for generating IBD for calling Infomap (population structure)
- Identify and validate IBD peaks (due to selection)
- Remove IBD within validated IBD peak region and generate a selection-corrected version of IBD for calling IBDNe and Infomap
- Call
IBDNe
andInfomap
for Ne and population structure inference
The pipeline has been tested on Linux Operation system and can be easily adapted to MacOS with simple changes. Software dependencies and the version numbers are specified in the './env.yaml' Conda recipe. Additional depencies that are not available from Conda are specified in the installation instruction below. The overall installation time is about 5-15 minutes.
To create the software environment:
- Install nextflow. See nextflow documentation
- Install conda from here if you have not
- Install software:
python3 ./init.py
, this will
- Activate the
simulation
environment:conda activate simulation
- Run the pipeline:
nextflow ./main.nf -profile sge --num_reps 30 -resume
. - For large datasets, using a cluster such as SGE is recommended. An example
sge
profile is provided in thenextflow.config
file and should be adjusted to fit your cluster system. If run on a local computer, please remove the-profile sge
option from the above command. - The pipeline can be reconfigured in the following files
- Pipeline parameters can be found top lines in
main.nf
- More simulation parameters can be found in the definition of
sp_defaults
,mp_defaults
,sp_sets
andmp_sets
dictionaries within themain.nf
file - For more complicated or large scale simulations, the
--sp_sets_json
,--mp_sets_json
arguments and-entry
option are recommended. seesimulations/Readme.md
for examples.
- Pipeline parameters can be found top lines in
- No input or test data is needed. If desired, pipeline can be reconfigured as
mentioned above. The pipeline can be tested by commenting all but one entry in
the
sp_sets
ormp_sets
. - Output files/folders:
- each subfolder of
resdir
represents simulations for a set of chromosomes - within each subfolder:
ifm_output
contains infomap resultsne_output
contains IBDNe estimates
- each subfolder of
If you find this repository useful, please cite our preprint:
Guo, B., Borda, V., Laboulaye, R., Spring, M. D., Wojnarski, M., Vesely, B. A., Silva, J. C., Waters, N. C., O'Connor, T. D., & Takala-Harrison, S. (2023). Strong Positive Selection Biases Identity-By-Descent-Based Inferences of Recent Demography and Population Structure in Plasmodium falciparum. bioRxiv : the preprint server for biology, 2023.07.14.549114. https://doi.org/10.1101/2023.07.14.549114
Other citations:
iHS
statistics and calculation:
iHS calculation via scikit-allel:
Miles, A. et al. cggh/scikit-allel: v1.3.7. (2023) doi:10.5281/ZENODO.8326460.
iHS statistics:
Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).
IBDNe
Browning, S. R., & Browning, B. L. (2015). Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. American journal of human genetics, 97(3), 404–418. https://doi.org/10.1016/j.ajhg.2015.07.012
Infomap
algorithm
Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America, 105(4), 1118–1123. https://doi.org/10.1073/pnas.0706851105