Helminth Population Genomic Analysis Pipeline (HPGAP
) overcomes the difficulties in variant calling and is able to explore key aspects centering the population genetics of helminths, including defining population structures, estimating reproductive modes and identifying loci under local adaptation.
HPGAP
automates six key bioinformatic components in population genomic analysis using customised R, Perl, Python and Unix shell scripts. These scripts and associated programs were then packaged into a platform independent image to enable setting up an identical system environment.
This repo contains the implementation for HPGAP
, along with its associated support scripts, libraries and example population genomic data. You can also check the installed programs and packages with the DockerFile.
The workflow for analysis using HPGAP
consists of five basic parts. 1) Variant calling; 2) Sample filtering; 3) Inferring genetic relationships; 4) Intra-population analysis; 5) Identifying loci under natural selection.
- git v2.20.1
- wget
- miniconda3
- udocker v1.1.1
To install the pipeline, just need to clone the git directory and install udocker and miniconda with the following instructions.
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
echo -e "\nyes\n\nyes\n" | bash Miniconda3-latest-Linux-x86_64.sh
export PATH=~/miniconda3/bin:$PATH
conda update -y -n base conda
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
echo "########################"
echo "# Installing git ..."
conda install -c conda-forge git=2.20.1
git clone https://github.com/vetscience/HPGAP.git # We will refer to the path that you clone the git repository as HPGAP path (eg. /home/foobar/HPGAP)
Udocker requires either pycurl or the curl to download both the binaries and/or pull containers from repositories and tar to unpackage binaries and libraries. Please check https://github.com/indigo-dc/udocker/blob/master/doc/installation_manual.md for more details of installing udocker.
echo "########################"
echo "# Installing udocker ..."
curl https://download.ncg.ingrid.pt/webdav/udocker/udocker-1.1.2.tar.gz > udocker-tarball.tgz
export UDOCKER_TARBALL=$(pwd)/udocker-tarball.tgz
tar xzvf $UDOCKER_TARBALL udocker
./udocker install
mv ./udocker $HOME/bin/ # move the executable to your preferred location for binaries
wget https://github.com/proot-me/proot-static-build/raw/master/static/proot-x86_64 -P $HOME/bin
wget https://github.com/proot-me/proot-static-build/raw/master/static/proot-x86 -P $HOME/bin
chmod uog+x $HOME/bin/proot*
echo "###########################"
echo "# Setting up container ..."
udocker pull walkboy3000/hpgap:latest
udocker create --name=HPGAP_c1 walkboy3000/hpgap:latest
After you create a udocker container and decide the preferred genome, you can run these commands to install the database for SNPEFF into the container and also revise the path of gff file for the following steps. For more details, please check http://snpeff.sourceforge.net/SnpEff_manual.html#databases.
- First, use any text editor to create a bash script (e.g. run.sh), then copy, paste and edit the following commands into this file. Pleae make sure the source files (fasta and gff files) are not soft linked.
cd /root/miniconda3/share/snpeff-4.3k-0
#dbname is the preferred name of the SNPEff database.
dbname="Clonorchis_sinensis_henan" # example
mkdir -p "data/$dbname"
cp <the path of genome fasta file> data/$dbname/sequences.fa
cp <the path of gff file> data/$dbname/genes.gff
echo "$dbname.genome : $dbname" >> snpEff.config
snpEff build -gff3 -v $dbname
cd -
- Second, run the bash script (e.g. run.sh) with udocker to create the database on the container . Please move your source files (fasta and gff files) to your working directory. If the source files (fasta and gff files are not located in your current working directory, please mount the directory as well).
udocker run -v <your host working directory>:<your host working directory> -v <the directory of the source files>:<the directory of the source files> --env="PATH=/usr/local/bin/:/usr/bin:/bin/" HPGAP_c1 /bin/bash -c 'sh run.sh'
To run the pipeline, first, you need to prepare the configuration file (see "setting the configuration file"). After that, you can use the following command to generate a "HPGAP.main.sh", which contains the specific command lines for running each step of the analysis.
Warning: HPGAP.main.sh and allcfg.yml should be regenerated if you change the configuration file.
udocker run -v <HPGAP path>:<HPGAP path> -v <your host working directory>:<your host working directory> --env="PATH=<HPGAP path>:/usr/local/bin/:/usr/bin:/bin/" HPGAP_c1 /bin/bash -c 'HPGAP.pl --config <the path to your configuration file>'
This command will generate an HPGAP.main.sh in your working directory. You should first try to run the HPGAP.main.sh step by step.
Usage
HPGAP.pl --config <path to the .yml config file> --run <String> [-options]
--run <String> use this option choose one of steps below (the option should not be used with --step at the same time)
step0_indexing
step1_read_filtering
step1_read_mapping
step1_read_report
step1_recalibration
step1_variant_calling
step1_combine_calling
step1_variant_filtering
step2_relatedness
step3_phylogeny
step3_admixture
step4_homozygosity
step4_roh (run of homozygosity)
step4_ld (linkage disequilibrium)
step4_slidingwindow
step4_sfs (snp frequency spectrum)
--config path to the .yml config file
--step <String> specified steps separated by semicolon(;). The names of analyses in each step are separated by comma (,);
(e.g. "0:indexing;1:read_filtering,read_mapping,read_report,recalibration,variant_calling,combine_calling,variant_filtering;3:phylogeny,admixture;4:homozygosity,roh,ld,slidingwindow,sfs").
All the avaliable analyses in each step:
0:indexing;
1:read_filtering,read_mapping,recalibration,variant_calling,combine_calling,variant_filtering;
2:relatedness
3:phylogeny,admixture;
4:homozygosity,roh,ld,slidingwindow,sfs
--skipsh use this to skip running bash inside the pipeline
--help
Note
1. No symbolic link to the files outside the mounted volume, which means all the data files themselves should be located within the mounted volume.
2. For each pair of fastq data, the second colomn (library or flowcell code) should always be unique.
3. All the paths need to be absolute path
Example
To be done
USAGE
- FASTQ files of each sample
- FASTA file of all the reference sequences. All reference FASTA sequences should be ended with .fa.
- HPGAP generated or custom VCF file (including variant information of all the samples)
- HPGAP generated or custom VCF file (including variant information of all the samples)
- HPGAP generated or custom VCF file (including variant information of all the samples)
- GFF file of the reference genome
Here is an example configuration file, you can edit the paths and values within the file and use that for the example analysis.
Warning: YAML file is sensitive to indentis and colon of keys should be followed by one space. Please follow the indents and spaces of the example configuration file carefully.
---
args:
container: HPGAP_c1 # Specify the name of the hpgap container
env: PATH=/root/admixture_linux-1.3.0/:/root/gatk:/root/miniconda3/bin:/usr/local/sbin:/usr/local/bin/:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:<HPGAP path>:<HPGAP path>/lib:<HPGAP path>/Tools # Set up the $PATH in the container. The paths to git cloned directory and the lib and Tools directories as well.
mount: # Set up the mapping to mount your host directories to the directories of the container. The directory including the rawdata should be mounted as well. We recommend you to create a local tmp directory (e.g. /home/foobar/tmp) and mount it to the container.
-
host_tmp: <your tmp directory >
-
host_path: <HPGAP path>
-
host_path: <your working directory>
outdir: <your output directory> # This directory should be located within the working directory.
ploidy: '2' # set 1 for haploid; set 2 for diploid.
threads: 40 # set the number of threads
fqdata: # Set up the sample fqdata
1-1: # Sample id
rawdata:
CL1-1: #library id. When you have more than one library, plese put them together under the same sample ID.
Flag: PE # PE (paired end) or SE (single end)
PL: BGISEQ500 # Sequencing platform
Phred: '33' # Phred scoring system of the fastq data
fq1: <HPGAP path>/Example/Input/Data/1-1_r1.fastq.gz # You need to edit the path here and make it points to the fastq file (e.g. /home/foobar/Example/Input/Data/1-1_r1.fastq.gz)
fq2: <HPGAP path>/Example/Input/Data/1-1_r2.fastq.gz
1-2:
rawdata:
CL1-2:
Flag: PE
PL: BGISEQ500
Phred: '33'
fq1: <HPGAP path>/Example/Input/Data/1-2_r1.fastq.gz
fq2: <HPGAP path>/Example/Input/Data/1-2_r2.fastq.gz
14-1:
rawdata:
CL14-1:
Flag: PE
PL: BGISEQ500
Phred: '33'
fq1: <HPGAP path>/Example/Input/Data/14-1_r1.fastq.gz
fq2: <HPGAP path>/Example/Input/Data/14-1_r2.fastq.gz
17-1:
rawdata:
CL17-1:
Flag: PE
PL: BGISEQ500
Phred: '33'
fq1: <HPGAP path>/Example/Input/Data/17-1_r1.fastq.gz
fq2: <HPGAP path>/Example/Input/Data/17-1_r2.fastq.gz
26-1:
rawdata:
CL26-1:
Flag: PE
PL: BGISEQ500
Phred: '33'
fq1: <HPGAP path>/Example/Input/Data/26-1_r1.fastq.gz
fq2: <HPGAP path>/Example/Input/Data/26-1_r2.fastq.gz
32-1:
rawdata:
CL32-1:
Flag: PE
PL: BGISEQ500
Phred: '33'
fq1: <HPGAP path>/Example/Input/Data/32-1_r1.fastq.gz
fq2: <HPGAP path>/Example/Input/Data/32-1_r2.fastq.gz
35-1:
rawdata:
CL35-1:
Flag: PE
PL: BGISEQ500
Phred: '33'
fq1: <HPGAP path>/Example/Input/Data/35-1_r1.fastq.gz
fq2: <HPGAP path>/Example/Input/Data/35-1_r2.fastq.gz
44-1:
rawdata:
CL44-1:
Flag: PE
PL: BGISEQ500
Phred: '33'
fq1: <HPGAP path>/Example/Input/Data/44-1_r1.fastq.gz
fq2: <HPGAP path>/Example/Input/Data/44-1_r2.fastq.gz
49-1:
rawdata:
CL49-1:
Flag: PE
PL: BGISEQ500
Phred: '33'
fq1: <HPGAP path>/Example/Input/Data/49-1_r1.fastq.gz
fq2: <HPGAP path>/Example/Input/Data/49-1_r2.fastq.gz
56-1:
rawdata:
CL56-1:
Flag: PE
PL: BGISEQ500
Phred: '33'
fq1: <HPGAP path>/Example/Input/Data/56-1_r1.fastq.gz
fq2: <HPGAP path>/Example/Input/Data/56-1_r2.fastq.gz
56-2:
rawdata:
CL56-2:
Flag: PE
PL: BGISEQ500
Phred: '33'
fq1: <HPGAP path>/Example/Input/Data/56-2_r1.fastq.gz
fq2: <HPGAP path>/Example/Input/Data/56-2_r2.fastq.gz
population:
1-1: # Sample id
'presumed population': South # Here, you can assign population labels for your samples (eg. South and North Populations)
1-2: # Sample id
'presumed population': South
14-1: # Sample id
'presumed population': South
17-1: # Sample id
'presumed population': South
26-1: # Sample id
'presumed population': South
32-1: # Sample id
'presumed population': South
35-1: # Sample id
'presumed population': South
44-1: # Sample id
'presumed population': South
49-1: # Sample id
'presumed population': South
56-1: # Sample id
'presumed population': North
56-2: # Sample id
'presumed population': North
ref:
choose: Cs-c1 # You can choose the reference you prefer for the following analysis
db:
Cs-c1: # label of reference 1
name: Cs-c1 # label of reference 1
path: <HPGAP path>/Example/Input/Cs-c1.example.fa
Cs-k2: # label of reference 2
name: Cs-k2 # label of reference 2
path: <HPGAP path>/Example/Input/Cs-k2.example.fa
step1: # parameter settings for step1
variant_calling_mode: 'fast' # if fast mode is selected, base recalibration of gatk will be skipped to spead up the variant calling, otherwise, two rounds of base recalibration will be applied.
variant_filtering:
indel: 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0' # The settings for indel filtering. To modify that, please check the manual of GATK4
ldcutoff: '0.3' # cut off for filtering SNVs with high LD
ldwindowsize: '50' # window size for for filtering SNVs with high LD
ldwindowstep: '10' # window step for filtering SNVs with high LD
scaffold_length_cutoff: '0' # only analyse the scaffolds with length larger than the cutoff
scaffold_number_limit: '2' # the maximum number of scaffolds
snp: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' # The settings for snv filtering. To modify that, please check the manual of GATK4
step4: # parameter settings for step4
discoal:
hard_simulation_times: '100' # the number of simulation for generating simulated hard sweep data
neut_simulation_times: '100' # the number of simulation for generating simulated neutral data
soft_simulation_times: '100' # the number of simulation for generating simulated soft sweep data
slidingwindow:
gff: <HPGAP path>/Example/Input/clonorchis_sinensis.example.gff # the gff file for the species
scaffold_length_cutoff: '5000' # only analyse the scaffolds with length larger than the cutoff
scaffold_number_limit: '10000' # the maximum number of scaffolds
snpeff_species: Clonorchis_sinensis_henan # the species name in the snpeff database
windowsize: '5000' # window size for scanning the whole genome in a sliding window manner
00.INDEXING
├── [Reference genome ID].dict #Dictionary file of the genome
├── [Reference genome ID].fa #FASTA file of the genome
├── [Reference genome ID].fa.amb #text file recording the "N" in the reference genome
├── [Reference genome ID].fa.ann #text file recording reference sequences, name, length
├── [Reference genome ID].fa.bwt #binary file of the Burrows-Wheeler transformed sequence
├── [Reference genome ID].fa.fai #fasta index file
├── [Reference genome ID].fa.pac #binary file of packaged sequence
└── [Reference genome ID].fa.sa #binary file of suffix array index.
01.QualityControl
├── Combined
│ ├── high_confidence_prunned.vcf.gz #VCF file of SNPs with low LD
│ ├── high_confidence.vcf.gz #VCF file of the SNPs with complete infomation on all the samples
│ ├── PASS.SNP.DP.vcf.gz #VCF file of filtered SNPs with enough read depth
│ ├── singletons.list #List of SNP singletons
│ └── snv.summary.txt #Summary of variants
├── read_filtering
│ ├── [Sample ID]
│ │ ├── [Library ID + Sample ID]_1.filt.fq.gz # filtered read 1
│ │ ├── [Library ID + Sample ID]_2.filt.fq.gz # filtered read 2
├── read_mapping.[of reference genome ID]
├── [Sample ID]
├── [Sample ID].sorted.markdup.bam #Duplication marked read mapping BAM file
└── bam.stats.txt #Summary of the BAM file
03.GeneticRelationships
├── Admixture
│ ├── CV.error.txt # Cross validation error of each K value
│ ├── high_confidence_prunned.vcf.gz # VCF file used for this analysis
│ ├── K[K value].png #Admixture graph
└── Phylogeny
├── plink_data.fasta # FASTA file of concatenated SNPs
└── result2.tre # The output phylogeny
05.IntraPopulation
├── LD #Linkage disequilibrium
│ ├── South.GLD.png # LD decay graph
│ ├── South.scf00001.GLD_window_1M.summary #Average LD on each distance
│ └── South.SNP.vcf.gz #VCF file of the SNPs used for this analysis
├── ROH #Run of Homozygosity
│ ├── South.hom.indiv #Average length of homozygous regions of each individual
│
├── SFS
│ ├── [Population name]
│ │ ├── EstimatedNe.list #Estimated population size of each run
│ │ ├── [nonsyn_sfs|syn_sfs|total_sfs]
│ │ │ └── fastsimcoal2
│ │ │ └── South_MSFS.obs #SNP frequency spectrum
│ │ ├── observedFVFiles
│ │ │ └── [Scaffold ID].[Population ID].SNP.preds # predicted loci under natural selection
│ │ ├── trainingSets
│ │ │ ├── hard.fvec #simulated data of hard sweep loci
│ │ │ ├── linkedHard.fvec #simulated data of regions linking to hard sweep loci
│ │ │ ├── linkedSoft.fvec #simulated data of regions linking to soft sweep loci
│ │ │ ├── neut.fvec #simulated data of neutral loci
│ │ │ └── soft.fvec #simulated data of soft sweep loci
└── Slidingwindow
├── final.[population ID].allstat.txt # Statistics of each sliding window
├── snpEff.nonsyn.vcf.gz #VCF file of nonsynnonymous SNPs
├── snpEff.syn.vcf.gz #VCF file of synnonymous SNPs
├── snpEff.vcf.gz #functional annotated VCF file
├── [population ID].snpEff.nonsyn.pi.list #polymorphism of synnonymous SNPs
├── [population ID].snpEff.syn.pi.list #polymorphismfile of nonsynnonymous SNPs