All information obtained and or inferred with this software is without any implied warranty of fitness for any purpose or use whatsoever.
Use at your own risk. I cannot provide support.
Know that everybody has written some version of this pipeline. Here's mine. I'm working on a similar workflow that will use Snakemake to scatter/gather GATK4 HaplotypeCaller.
This Snakemake workflow takes in paired end short reads (fastq format), and produces a file of raw SNP, indel, and short haplotypes in VCF format.
Note, raw variants should not be the final product of your analysis. You can and should apply post-hoc quality filters to raw variants according to your needs.
- Build DM1-3 v4.04 reference genome, including DM1-3 chloroplast and mitochondrion sequences
- Download reads from NCBI SRA (sra-tools 2.8.2)
- Read trim (cutadapt 1.15)
- Alignment to reference (bwa mem 0.7.12-r1039)
- Remove PCR duplicates (Picard 2.18.27-0)
- Filter out pairs w/mates on different chromosomes
- Soft clip one mate of an overlapping mate pair (bamUtil 1.0.14)
- Merge reads by biological sample (samtools 1.9)
- Define indel realign targets (GATK 3.8 RealignerTargetCreator)
- Realign indels (GATK 3.8 IndelRealigner)
- Call variants (freebayes 1.2)
- Clone this repository
git clone https://github.com/kramundson/fastq_2_vcf
- Install miniconda3
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# review license
# accept license
# accept or change home location
# yes to placing it in your path
source $HOME/.bashrc
# change location of envs_dirs
# default is home directory but conda crashes if home folder is not writable
conda config --add envs_dirs ./.conda/envs
# change location of pkgs_dirs
# default is home directory but conda crashes if home folder is not writable
conda config --add pkgs_dirs ./.conda/pkgs
- Install dependencies using included file
environment.yaml
# TODO: include gatk into environment
# conda install -c bioconda gatk # then this becomes unnecessary
conda env create --name fastq2vcf -f environment.yaml
# follow prompts to finish environment build
# activate environment
source activate fastq2vcf
- Install GATK 3.8
On its own, the conda recipe of GATK3 will not let you run GATK3. Per conda instructions, you must download a licensed version of GATK from the Broad Institute to run gatk with conda.
For this workflow, download GATK 3.8-1-0gf15c1c3ef from https://software.broadinstitute.org/gatk/download/archive to this folder.
Then, copy gatk3 to your conda folder:
# to add gatk executable to path via conda
gatk3-register GenomeAnalysisTK-3.8-1-0-gf15c1c3ef.tar.bz2
Here, I've specified the workflow to run on a few samples that are publicly available in
NCBI SRA. To keep run time low, we'll only use a very small subset of reads from each
sample. I've included subsets in this repo, under the subdirectory data/reads
Each sample is WGS of a different potato cultivar. To illustrate how different sequencing libraries from the same cultivars are handled, multiple libraries of cultivars Atlantic and Superior are included here.
snakemake -s 1_init_files.snakes --cores 8 > test_init.out 2> test_init.err
snakemake -s 2_fastq_to_vcf.snakes --cores 8 > test_fq2vcf.out 2> test_fq2vcf.err
Here, we'll walk through an example of how to add new samples. In this case, we're going to add more reads from NCBI SRA: SRR2069932, which is reference genotype DM1-3.
# make a backup copy of units.tsv just in case
cp units.tsv units.tsv.bak
# add sample info for DM1-3
printf "2x_DM1_3\tSRR2069932\tSRR2069932_1.fastq.gz\tSRR2069932_2.fastq.gz\n" >> units.tsv
We also need to add the per-sample ploidy to the file freebayes-cnv-map.bed
.
Here, I made a backup of the current CNV map, substituted the sample name, and
appended the modified block to the existing CNV map.
# make a backup copy of freebayes-cnv-map.bed just in case
cp freebayes-cnv-map.bed freebayes-cnv-map.bed.bak
# Get rows of a diploid sample, subsitute sample name IVP101 by DM1_3 and append to file
head -n 14 freebayes-cnv-map.bed.bak | sed -e 's/IVP101/DM1_3/g' >> freebayes-cnv-map.bed
Download a very tiny subset of SRR2069932 reads to subdirectory data/reads/
fastq-dump --gzip -B -N 1000000 -X 1001000 --split-files -O data/reads SRR2069932
Then, rerun the test workflow:
snakemake -s 2_fastq_to_vcf.snakes --cores 8 > test2_fq2vcf.out 2> test2_fq2vcf.err
The workflow should run without errors and produce the final output file data/calls/all-calls.vcf
To add your own reads that aren't on SRA, put the raw reads (uninterleaved fastq) in the
folder data/reads/
, add your sample to freebayes-cnv-map.bed
and fill out a
new row in units.tsv
for each set of paired end reads.
The columns should be filled out as follows:
- sample: the biological sample (will be merged with other reads originating from the same biological sample)
- unit: unique combination of biological sample, sequencing library, flow cell, and lane. Each set of paired end reads should have a unique identifier.
- fq1: name of forward reads file
- fq2: name of reverse reads file
- make a more realistic test case.
- Workflow for GATK4 HaplotypeCaller