Untangling human migration through simulated admixture
Collection of python and shell scripts for performing msprime simulations relevant for evaluating admixture and back migration events.
All required packages are listed in environment.txt.
To create a conda environment for the project on Della, first activate anaconda3:
module load anaconda3
The conda environment is then created with:
conda create -n msprime_scripts --file msprime.yml
Prior to running on Della, you must activate anaconda3:
module load anaconda3
The conda environment is then activated with:
conda activate msprime_scripts
With the correct environment, an admixture simulation is started simply with
python Admixture_Simulation.py
to use all default values. Several options are available and can be displayed with the -h flag. Most affect parameters of the msprime simulation. Output options include the debug output, options listing, haplotype call file, ILS call file, vcf and f4dstat information. By default, the debug information will be printed to stdout and the script will exit. Flags can activate other file types including the output file names. For debug, options, haplo and ils, excluding a file name will cause them to output to stdout. Finally, specifying only an output directory will generate all file types with preset names.
The unit tests for msprime_scripts are run with pytest
from anywhere in the directory.
Slurm jobs are submitted using the corresponding submit script within the slurmJobs directory. The behavior of the submission is customized within the submit script which performs some basic operations before submitting the slurm array job.
A snakemake workflow is available for performing numerous simulations and analyzing the results with S star and match p value. Two configurations are provided for the Princeton clusters della and gencomp1. Execution is controlled through config files in the snakefiles directory. The run scripts will begin a workflow execution. The loop_run script gives an example of how to run several instances with different parameters in a sequential mode.
Dependencies of the snakemake workflow contain bugs which should be corrected for best performance.
The archaic match __main__.py has a bug where Nonetype callsets are not handled properly. After the conda environments are created, find the offending file and change line 390 to:
if callset is None or 'calldata/GT' not in callset:
Snakemake, as of version 5.4.0, has errors when attempting to restart group jobs which are partially completed. A key error is generated in the dag.py file which can be corrected by changing line 818 to:
stop = lambda j: j.group != job.group or j not in self.needrun_jobs
TBD