- update scripts for model choice
- add 3- and 4- pops pipelines
- remove unnecessary scripts
whole pipeline to perform coalescent simulations and abc inferences
This pipeline was used to reconstruct the demographic history of Atlantic salmon salmo salar but workds for any SNPs/sequence data to perform demographic modelling.
I also provide additional script to reproduce major analysis from the paper about "reconstructing the history of Atlantic salmon salmo salar using ABC" now publish in Evolution and available here
- compile the coalescent simulator and stats calculator
- prepare your input file
- choose the model you want to run and the appropriate prior
- run the coalescent sims
- reshape the data
- run the abc analysis for :
-
(a) model selection
-
(b) robustness computation
-
(c) parameters estimation and goodness of fit
- (eventually run the pipeline using only barrier loci)
- (eventually draw neutral enveloppe to perform outlier detection test - I will add this later )
depending on your architecture there is more or less efficient way to run the code. icc compilation will make code runs faster. command lines for compilation are provided in:
compile_msnsam.sh
located in src/msnsam_code
folder
compile_mscalc.sh
located in src/mscalc_code
folder
- Alternatively for The coalescent simulator (msnsam) you can type:
gcc -O2 -o msnsam msnsam.c rand1.c streec.c -lm
gcc -o samplestats tajd.c sample_stats.c -lm
gcc -o mean_std stats.c -lm
if you have access to an icc architecture you can speeds up the code :
icc -O3 -xHost -o ../../bin/msnsam msnsam.c rand1.c streec.c -lm
- Then for The summary statistics computer (mscalc):
gcc *.c -o mscalc -lm
if you have access to an icc
architecture you can speed up the code :
icc -O3 -xHost *.c -o ../../bin/mscalc -lm
The data I used for abc analysis of the Salmo salar are in a companion github repository.
Download the github repository store the data in 01-salmon_data
and run:
./00-scripts/utility_scripts/prepare_abc.sh 01-salmon_data/salmon.data.ped 01-salmon_data/salmon.data.map
all the data will be created in different folders.
Now these data are ready to perform coalescent simulations. You can clone/paste the folders
00-scripts/ bin/ results/ and src/
in all the pairwise folder where you want to run simulations.
The original data are available on dryad:
10.5061/dryad.gm367 and doi:10.5061/dryad.sb601
If you want you can dowload them, merge them and filter them in plink
using the options:
--geno 0.05
--mind 0.05
--noweb
this will result in a dataset of ~5000 SNPs and 2035 individuals as provided in the sister github repository
if you are lazy preparing the data i also provide a ready to use example in the folder 00-example_input_abc
for the sister github repository
Then you'll have to choose the models and prior associate to each models.
Four classical models are implemented:
- Strict Isolation (SI)
- Isolation w. migration (IM)
- Secondary Contact (SC)
- Ancient migration (AM)
All models were used for the Salmon data. In addition the following models were implemented for some other project I may publish one day:
- panmixia
- bottleneck
- equilibrium
The four classic models have different alternatives. See Roux et al. 2013 MBE, Roux et al. 2014 JEB, Roux et al. 2016 Plos Biol for more info These are:
- NhomoMhomo (=same migration rate M and Ne among loci)
- NhomoMhetero (=same Ne among loci but different M among loci)
- NheteroMhomo (=same M but different Ne among loci )
- NheteroMhetero (= different M and different N among loci)
For the SC models there is also the possibility to test for:
- Periodic Secondary Contacts (PSC) when populations undergone two phases of splits and mixture during the divergence process
- bottleneck post divergence in the daughter populations
in ms
(and msnsam
) all Ne are scaled by 4*Nref*µ (*L) where:
Nref = size of a reference population
µ = mutation rate
L = length of the locus
so that θ = 4*N1*µ*L/4*Nref*µ*L
Similarly Τ = Tsplit/4*Nref*µ*L
and M = 4*Nref*m
For more details please look at the msdoc.pdf
manual before running your inferences.
Once you are okay with ms
coalescent parameters, you may want to modify the priors:
All priors on N1, N2, Nancestral, Migration rates, Split times, etc can be edited in the 00-scripts/models/model.*.sh
files.
Coalescent models are found in 00-scripts/models/model.{1..18}.sh
files.
Each number corresponds to a different model as follows:
- model.1.sh = SI heterogenous Ne
- model.2.sh = SI homogenous Ne
- model.3.sh = AM heterogenous Ne and heterogenous M
- model.4.sh = AM homogenous Ne and heterogenous M
- model.5.sh = AM heterogenous Ne and homogenous M
- model.6.sh = AM homogenous Ne and homogenous M
- model.7.sh = SC heterogenous Ne and heterogenous M
- model.8.sh = SC homogenous Ne and heterogenous M
- model.9.sh = SC heterogenous Ne and homogenous M
- model.10.sh = IM heterogenous Ne and heterogenous M
- model.11.sh = IM homogenous Ne and heterogenous M
- model.12.sh = IM heterogenous Ne and homogenous M
- model.13.sh = SC homogenous Ne and homogenous M
- model.14.sh = IM homogenous Ne and homogenous M
- model.15.sh = SCb heterogenous Ne and heterogenous M
- model.16.sh = PSC heterogenous Ne and heterogenous M
- model.17.sh = PSCb heterogenous Ne and heterogenous M
- model.18.sh = PCSbottleneck heterogenous Ne and heterogenous M
- model.19.sh = EQ heterogenous Ne homogenous M
- model.20.sh = EQ homogenous Ne heterogenous M
- model.21.sh = EQ heterogenous Ne heterogenous M
- model.22.sh = EQ homogenous Ne homogenous M
- model.00.sh = PAN homogenous Ne
- model.0.sh = PAN heterogenous Ne
I've run them using moab scheduler
on a cluster where I could launch several job in parallel.
This is the purpose of the 00-scripts/models/model_job_array*.sh
scripts that you'll have to customize according to your own cluster.
run the script ./00-scripts/00.reshape.sh
This will merge all parallel simulation into one prior file and one summary_statistics file with 1milions row (1 row/simulations)
(a) model selection
(b) robustness computation
(c) parameters estimation and goodness of fit
dependencies:
- R software
abc
package : more info here
run ./00-scripts/01.model.selection.sh
you'll need some abc background (see references list below)
##### robustness computation
cd compute_robustness.abc
run ./01-scripts/01_robustess_job.sh args
where arg is the name of the model either si.1 si.3 am.1 am.2 am.3 am.4 im.1 im.2 im.3 im.4 sc.1 sc.2 sc.3 sc.4
this is very time consumming....
./00-scripts/03.paramestim.im.all.sh
./00-scripts/03.paramestim.sc.all.sh
./00-scripts/03.paramestim.si.all.sh
./00-scripts/03.paramestim.am.all.sh
./00-scripts/06.goodness_fit.sh #for sc only!!!
./00-scripts/runall_abc_job.sh #for moab only
To do
treemix
available here and relevant reference here and there.
gsnap
available here
samtools
available here
The softwares must be into your path.
- download the Salmo salar reference genome here
store the data in 02-aditional_analisys/00-genome_alignment/01-input
- merge all chromosomes into a single fasta file (I used both the masked and unmasked version, it did not changed much of the results)
run a simple command like:
for i in 00-genome_alignment/01-input/*gz
do
gunzip $i ;
done
for i in 00-genome_alignment/01-input/*chrssa*fa
do
cat $i >> 00-genome_alignment/01-input/genome_fa ;
done
run gsnap
-
build the genome
./00-additional_analysis/00-scripts/01.gsnap_build.sh
-
align the sequences
./02-additional_analysis/00-scripts/02.gsnap_align.sh
-
then remove hardclip and softclip and order the ped files according the markers positions in the genome
(I'll document that soon)
I provide the reference fasta sequence in the sister repository
run Treemix and f3-tests
as always, the data (ped files) to reproduce th example are located in the sister repository in 02-treemix-data
.
You can download them, uncompress and store them in 02-additional_analysis/02-treemix/
#if you have already used treemix then you'll surely already have your own script to run the software.
#in this case simply prepare the data as follows
datafolder=02-additional_analyis/02-treemix
plink --file "$datafolder"/salmon.ordered.V2 --noweb --missing --freq --double-id --allow-extra-chr --chr-set 29 --within cluster.dat --out "$datafolder"/plink
gzip "$datafolder"/plink.frq.strat
02-additional_analysis/00-scripts/treemix/plink2treemix.py "$datafolder"/plink.frq.strat.gz "$datafolder"/treemix.frq.gz
#then you'll know what to do.
#Alternatively to prepare the input from the ped files and run the f3-test use the following command:
./02-additional_analysis/00-scripts/treemix/prepare_treemix.sh #will prepare the treemix data and perform f3tests
#Then to run treemix use a script like the one here
./02-additional_analysis/00-scripts/treemix/run_treemix.sh
#Note that this script is very simplified version, I used a more elaborate script to run in parrallel mode.
You can download the data in the sister repository in 03-spacemix-data
and store them in 02-additional_analysis/spacemix/01-data
.
cd 02-additional_analysis/03-spacemix
#creating input file:
./02-additional_analysis/03-spacemix/00-scripts/00.ped_to_matrice.R path/to/input/input.ped #with input ped the name of the ped_file (with the path/)
./02-additional_analysis/03-spacemix/00-scripts/01.make.allele.freq.cov.2.R path/to/input/matrix #with matrix the matix name and the path
#run spacemix (Here I very simply follow guideline from Bradburg's github)
./02-additional_analysis/03-spacemix/00-scripts/02.run.spacemix.R sample.covariance.matrix sample.size x_y.coordinates
#this takes some times and needs to be run on a cluster
./02-additional_analysis/03-spacemix/00-scripts/03.chek.spacemix.model.R arg1 arg2 arg3
#then plot the results
#This is very straightforward if you follow examples from spacemix manual
Review about abc:
- Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian Computation in Population Genetics. Genetics. 2002;162: 2025–2035.
- Csillery, K., M.G.B. Blum, O.E. Gaggiotti and O. Francois (2010) Approximate Bayesian Computation (ABC) in practice. Trends in Ecology and Evolution, 25, 410-418.
Must Reads about Genome-Wide Heterogeneity:
Programms:
Salmon paper: