The code in this repository analyzes empirical WGS (Whole Genome Sequencing) data to assess the effect of positive selection on IBD (Identity by Descent)-based inference. This analysis process encompasses variant and sample filtering, multiplicity of infection inference, deconvolution and imputation (for both monoclonal and a subset of polyclonal samples), IBD calling, and IBD-based inferences of effective population size (Ne) and population structure. These functions are detailed and implemented across five modules, as described in the "Pipeline outline" section below.
The pipeline has been tested on Linux Operation system and can be easily adapted to MacOS with simple changes. Software dependencies and the version numbers are specified in the './env.yaml' Conda recipe. Additional depencies that are not available from Conda are specified in the installation instruction below. The overall installation time is about 5-30 minutes. The process is relatively long due to the compliation of the R package 'moimix' and its dependencies.
- Install Nextflow
- Install Conda
- Create a Conda evironment from a provided recipe:
conda env create -f ./env.ymal
- Activate conda environment:
conda activate empirical
- Install
moimix
package: RunR
and in R console run:
install.packages("BiocManager")
BiocManager::install("bahlolab/moimix", build_vignettes = TRUE)
- Modify (recombination rate) and install
hmmibd
:
git clone https://github.com/glipsnort/hmmIBD.git; cd hmmIBD
# checkout a specific version for reproducibility
git checkout a2f796ef8122d7f6b983ae9ac4c6fba35afcd3aa
sed -i -e 's/const double rec_rate = 7.4e-7/const double rec_rate = 6.67e-7/' hmmIBD.c
# use the complier from the `empirical` conda environtment
x86_64-conda_cos6-linux-gnu-gcc -o hmmIBD -O3 -lm -Wall hmmIBD.c
cd ..
cp hmmIBD/hmmIBD bin/hmmIBD
rm -rf hmmIBD
The entire pipeline is organized into five modules, which include:
01_filter_vcf.nf
: Filters VCF files by per-sample and per-site missingness and minor allele frequency.- This involves two rounds of filtering to help retain more samples/sites.
02_calc_pop_fws.nf
: CalculatesFws
statistics using themoimix
R package.- Data is organized into populations (Pfv6 classification).
- Within each population,
Fws
is inferred for individual samples.
03_phase_impute_monoclonal.nf
:- Filters samples and retains only monoclonal samples.
- Deconvolutes each monoclonal sample (dEploid).
- Imputes phased genotype data within each population (Beagle).
- Merges single-sample genotypes back to multiple-sample, phased & imputed VCF files.
04_phase_impute_polyclonal.nf
:- Uses a subset of phased & imputed monoclonal samples to build a deconvolution panel for polyclonal samples within each population.
- Filters samples and retains only polyclonal samples.
- Deconvolutes genotype data for each polyclonal sample using a population-specific panel (dEploid-IBD).
- Filters polyclonal samples, retaining only those with a dominant clone/genome.
- Combines phased polyclonal samples with the above monoclonal samples.
- Imputes the combined genotype data (monoclonal/polyclonal samples) within each population (Beagle).
05_ibd_ne_ifm.nf main.nf
:- Calls IBD using
hmmIBD
. - Processes IBD to prepare input (with or without selection correction) for IBDNe analysis.
- Processes IBD to prepare input (with or without selection correction) for Infomap analysis.
- Runs IBDNE to infer effective population size.
- Runs Infomap to infer population structure.
- Calls IBD using
The five modules will run sequentially if using main.nf
as the the entry point.
- Thresholds can be configured in the top lines of each module file.
- The pipeline is designed to run in parallel for large datasets.
- Each module is independent of the others, provided that valid inputs are supplied.
After install the pipeline, first prepare the input files and then run the nextflow pipeline with the input files.
The VCF file table is organized in a two-column format:
- Header: None
- Delimiter: Tab
- Column 1: Represents the genome interval
- Format for genome interval:
[Chromosome]:[Start]-[End]
- Format for genome interval:
- Column 2: Provides the full path to the joint-call VCF file corresponding to
the specified genome interval
- It's essential that VCF files for different genome intervals include the same set of samples.
- The VCF file should contain annotations for the hard filter/vqsr filter
within the FILTER columns. Refer to the output from the
snp_call_nf
pipeline for more information. - For those interested in reproducing the VCF files via
snp_call_nf
pipeline, you can find the accession numbers of WGS data for the samples used in this study within thesamples
folder.
Example file interval_vcf_table.tsv
Pf3D7_01_v3:1-459121 /full/path/to/filePf3D7_01_v3:1-459121.vqsrfilt.SNP.vcf.gz
Pf3D7_01_v3:459122-640851 /full/path/to/filePf3D7_01_v3:459122-640851.vqsrfilt.SNP.vcf.gz
Pf3D7_02_v3:1-448875 /full/path/to/filePf3D7_02_v3:1-448875.vqsrfilt.SNP.vcf.gz
Pf3D7_02_v3:448876-947102 /full/path/to/filePf3D7_02_v3:448876-947102.vqsrfilt.SNP.vcf.gz
...
Pf3D7_14_v3:2121871-2706733 /full/path/to/filePf3D7_14_v3:2121871-2706733.vqsrfilt.SNP.vcf.gz
Pf3D7_14_v3:2706734-3291936 /full/path/to/filePf3D7_14_v3:2706734-3291936.vqsrfilt.SNP.vcf.gz
Pf3D7_14_v3:568025-1073306 /full/path/to/filePf3D7_14_v3:568025-1073306.vqsrfilt.SNP.vcf.tz
If the VCF file is not in 'gz' format, use bgzip xx.vcf
to convert it to *.vcf.gz
format.
The population table is a two-column table
- Header: none
- Delimiter: tab
- The first column contains the population name
- The second column specifies the path to a file that contains all samples from the given population
- In the list file, each sample name appears on a separate line
- The sample names should be the same as in the input VCF files
- All samples in the VCF files should be contained in one of the population sample list files
Example file subpops_table.tsv
CAF subpops/CAF.txt
EAF subpops/EAF.txt
ESEA subpops/ESEA.txt
Lab subpops/Lab.txt
OCE subpops/OCE.txt
SAM subpops/SAM.txt
SAS subpops/SAS.txt
WAF subpops/WAF.txt
WSEA subpops/WSEA.txt
Example file subpops/ESEA.txt
Sample1
Sample2
Sample3
...
Samplen
The same format is shared maong all populaton sample files such as
subpops/EAF.txt
(East Africa), subpops/ESEA.txt
(Eastern Southeast Asia),
subpops/Lab.txt
(Laboratory Strains), subpops/OCE.txt
(Oceania),
subpops/SAM.txt
(South America), subpops/SAS.txt
(South Asia),
subpops/WAF.txt
(West Africa), and subpops/WSEA.txt
(Western Southeast
Asia).
conda activate empirical
nextflow ./main.nf -profile -resume -profile sge \
--input_vcf_fn ./interval_vcf_table.tsv \
--input_subpop_fn ./subpops_table.tsv \
--outdir results
For large datasets, using a cluster such as SGE is recommended. An example sge
profile is provided in the nextflow.config
file and should be adjusted to fit
your cluster system.
If run on a local computer, please remove the -profile sge
option from the
above command.
Test data for module 1-4 are not directly provided as they are extremely large but
can be easily generated by running the ena_test
test in the [snp_call_nf
pipeline']
(https://github.com/bguo068/snp_call_nf).
A small dataset is provided, however, for module 5. The test of module 5 can be run with the following command:
nextflow 05_ibd_ne_ifm.nf -profile sge --vcf 'test_data/*.vcf.gz'
- The
results/01_filt_vcf/08_maf_bcf
folder contains the final result files for module 1, i.e., VCF files filtered for samples or sites with high missingness and sites with low minor allele frequency. - The
results/02_pop_fws/04_moimix_fws
folder contains the final results for module 2, i.e., a table ofFws
statistics for each sample. Each population will have a separate table. - The
results/04_phase_impute_polyclonal/05_phased_imputed_polyclonal_monoclonal
folder contains the final result files for modules 3 and 4, i.e., combined VCF files that have been deconvoluted and imputed. Each population will have a separate VCF file. - The
results/05_ibdanalysis/02_callibd/hmmibd
folder contains the IBD results obtained usinghmmIBD
. - The
results/05_ibdanalysis/{population}/ne_output
folder contains the results of IBDNe (effective population size). - The
results/05_ibdanalysis/{population}/ifm_output
folder contains the results of Infomap (population structure).
If you find this repository useful, please cite our preprint:
Guo, B., Borda, V., Laboulaye, R., Spring, M. D., Wojnarski, M., Vesely, B. A., Silva, J. C., Waters, N. C., O'Connor, T. D., & Takala-Harrison, S. (2023). Strong Positive Selection Biases Identity-By-Descent-Based Inferences of Recent Demography and Population Structure in Plasmodium falciparum. bioRxiv : the preprint server for biology, 2023.07.14.549114. https://doi.org/10.1101/2023.07.14.549114
Other citations:
IBDNe
Browning, S. R., & Browning, B. L. (2015). Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. American journal of human genetics, 97(3), 404–418. https://doi.org/10.1016/j.ajhg.2015.07.012
Infomap
algorithm
Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America, 105(4), 1118–1123. https://doi.org/10.1073/pnas.0706851105
dEploid
:
Zhu, S. J., Almagro-Garcia, J., & McVean, G. (2018). Deconvolution of multiple infections in Plasmodium falciparum from high throughput sequencing data. Bioinformatics (Oxford, England), 34(1), 9–15. https://doi.org/10.1093/bioinformatics/btx530
dEploid-IBD
:
Zhu SJ, Hendry JA, Almagro-Garcia J, Pearson RD, Amato R, Miles A, Weiss DJ, Lucas TC, Nguyen M, Gething PW, Kwiatkowski D, McVean G; Pf3k Project. The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria. Elife. 2019 Jul 12;8:e40845. doi: 10.7554/eLife.40845. PMID: 31298657; PMCID: PMC6684230.
hmmIBD
Schaffner SF, Taylor AR, Wong W, Wirth DF, Neafsey DE. hmmIBD: software to infer pairwise identity by descent between haploid genotypes. Malar J. 2018 May 15;17(1):196. doi: 10.1186/s12936-018-2349-7. PMID: 29764422; PMCID: PMC5952413.
Beagle
Browning, B. L., Zhou, Y., & Browning, S. R. (2018). A One-Penny Imputed Genome from Next-Generation Reference Panels. American journal of human genetics, 103(3), 338–348. https://doi.org/10.1016/j.ajhg.2018.07.015
scikit-allel
(calculate iHS statistics)
https://zenodo.org/records/8326460
https://github.com/cggh/scikit-allel
For internal use only, refer to the following link: https://github.com/bguo068/posseleff_empirical_arxiv.