This session is meant to guide the user through alignment of sequence reads to a reference followed by reference-based genotyping with GATK. Why align to a reference? Shafer et al. (2016) compared de novo and reference-based pipelines for RADSeq data and found that using reference-based approaches with a closely-related genome led to summary statistics closest to the expected values.
Why GATK? Many folks choose to follow the GATK “Best Practices Pipeline”, for many reasons: 1) GATK is freely available and widely used, 2) GATK is under continuous improvement by a team of bioinformaticians at MIT’s Broad Institute, and 3) GATK is well-documented and has a moderated Q&A forum.
Caveats of GATK? Like many software, there's a learning curve. GATK is not designed with non-model organisms in mind, so not all tools function as intended, and you have to make sure you're using it appropriately for your specific study. If you have very low sequencing coverage, another genotyper might be preferable.
By the end of this session, you should know how to do the following:
- adequately trim and filter raw sequence reads prior to alignment
- prepare a reference genome for the pipeline (e.g. by indexing)
- align sequence reads to a reference genome using BWA
- call genotypes using the GATK pipeline
- decide on appropriate hard filtering of genotypes
As you'll hear throughout ConGen, there is no single pipeline that works for everyone. I put together this particular set of commands after thinking carefully about my particular data set and downstream analysis plans, and after reading carefully through other resources. If someone has shared their scripts with you, make sure those scripts work for your data! There are tons of available sources, such as the GATK Best Practice workflows and various blogs (example pipeline for non-model organism), as well as other workflows on github.
Log on to the R Server (see instructions here). Starting from our home directory, let's make a new folder for our genotyping pipeline then change to it. Substitute in your own user ID number, e.g. userN.data
cd ~
mkdir user10.data/genotyping_pipeline
cd user10.data/genotyping_pipeline
When I start a project, I like to make a series of directories to organize my data at various stages. This may not apply to us today, but I might also make directories to hold all of my scripts for the project, and other directories to hold bed files or analytical results.
mkdir reference
mkdir raw_fastq
mkdir pass_qc_fastq
mkdir bam
mkdir bqsr
mkdir raw_genotypes_HC
mkdir called_genotypes_HC
mkdir filtered_genotypes_HC
mkdir quality_metrics
Today, we're going to be mapping a subset of the deer mouse exome (the same reads we used for last week's Intro to Command Line session). I have downsampled the data to contain 250,000 paired-end reads each for two individuals (S137 and S144, from a low-altitude and high-altitude population, respectively). Let's copy these data and other reference files to your genotyping_pipeline directory in RStudio.
cp ~/instructor_materials/Rena_Schweizer/2023/data_for_gatk_handson/* .
Let's see what's in the directory now.
ls
Output:
The NW_006501067.1.fa file is your reference genome in fasta format. The fastq.gz files are the raw sequence reads. One of the vcf files will be used as a set of "known" variants for the GATK Base Quality Score Recalibration tool, and the other vcf file will be used at the end of the pipeline to practice filtering variants.
Let's finish geting organized by moving our files to our directories.
mv S1* raw_fastq
mv pman_n10_known* reference
mv NW_006501067.1.fa reference
mv pman_n10_congen* raw_genotypes_HC
We already ran FASTQC on two of these files during our previous session, so we know that overall the quality is high. If you'll remember, we had some small amount of adapter contamination in S144, which we can remove now. To be safe, we can also remove any adapter contamination from the other sample, S137. We're going to be defining variables in the shell then using commands with variables.
Remember we can use variable=value
to define a new shell variable, to which we can then refer by using ${variable}
in our command.
I'll show you how you can run these commands for two samples, but think about how you might be able to put these commands into a script and iterate the analysis over hundreds to thousands of samples.
Note: the \ tells the command line interpreter that your command is continuing on the next line. Writing commands this way allows us to display them more clearly.
You can learn more about trim_galore here: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
LANE_NUM=6
for SAMPLE_ID in S137 S144
do trim_galore --paired raw_fastq/${SAMPLE_ID}_L00${LANE_NUM}_R1_sub.fastq.gz \
raw_fastq/${SAMPLE_ID}_L00${LANE_NUM}_R2_sub.fastq.gz \
--phred33 \
--adapter AGATCGGAAGAGC \
--quality 20 \
--output_dir pass_qc_fastq/
done
trim_galore outputs to the terminal some useful information on the parameters and the number and sizes of sequences that are trimmed or removed.
Task: What files does trim_galore create?
Task: What percent of reads were retained for R1 and R2 in S144?
If we want, we can run these trimmed fastq files back through FASTQC to check that the quality is higher and the adapter sequences have been removed.
In our reference
directory, we have downloaded the reference genome from NCBI. This is the v1 of the genome for Peromyscus maniculatus and is in multiple scaffolds. We'll only be working with one scaffold (NW_006501067.1) today so that everything runs more quickly. You also have access to a set of 'known' variants for our gatk BQSR analysis later on (pman_n10_known_sites_NW_006501067.1.recode.vcf). Let's change to our reference directory and create the necessary indices that bwa, samtools, and gatk use for quicker access to the genome. This only needs to be done once per project.
cd reference
Task: How long is scaffold NW_006501067.1? How many "known" SNPs do we have in our vcf file?
We can use grep, as follows:
grep -v ">" NW_006501067.1.fa | wc
grep -v "#" pman_n10_known_sites_NW_006501067.1.recode.vcf | wc -l
Our NW_006501067.1 scaffold is 11362355 bp, and we have 57663 characterized SNPs on this scaffold.
bwa index NW_006501067.1.fa
samtools faidx NW_006501067.1.fa
gatk CreateSequenceDictionary -R NW_006501067.1.fa
gatk IndexFeatureFile -I pman_n10_known_sites_NW_006501067.1.recode.vcf
You can learn more about bwa here: http://bio-bwa.sourceforge.net/bwa.shtml or in Li and Durbin 2010.
samtools
is another utility tool that is designed to handle SAM and BAM format files: http://www.htslib.org/doc/samtools.html
If we check the contents of our directory, we will now see multiple new index files.
Let's return to our working directory and map our reads to the reference NW_006501067.1.
cd ../
Today we're going to be using bwa mem
to map, but there are lots of other options out there. Pick the software that is most appropriate for your specific project!
We can set additional variables for the reference genome path, as well as for the number of threads to tell bwa
to use. We will only use two threads today, but for your own work you should figure out the available resources for multi-threading.
REF=`(pwd)`/reference/NW_006501067.1.fa
THREADS=2
Note: If you specify a COMMAND within parenthesis and the `, the shell will interpret and run that COMMAND before it does the rest of whatever the command specifies. So, above, the "(pwd)
" tells unix to print the working directory and thus provides the complete path for the reference genome file.
In the command below, we are telling bwa mem
to map each of our filtered R1 and R2 to our reference contig using 2 threads, and to redirect the output to a SAM file. The -M
flag is necessary for compatibility with another program we'll use. Let's run our commands now.
for SAMPLE_ID in S137 S144
do echo "mapping $SAMPLE_ID"; bwa mem -M -t ${THREADS} \
${REF} \
pass_qc_fastq/${SAMPLE_ID}_L00${LANE_NUM}_R1_sub_val_1.fq.gz \
pass_qc_fastq/${SAMPLE_ID}_L00${LANE_NUM}_R2_sub_val_2.fq.gz \
> bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.sam
done
This will take a couple of minutes to run. Take a quick stretch break if you need it, then, familiarize yourself with the SAM and BAM formats here: https://samtools.github.io/hts-specs/SAMv1.pdf
At the end of this step, we should have a SAM file of aligned, paired end reads for both samples. We can take a look at one of the files using samtools view
.
samtools view -h bam/S144_L006_aln-pe.sam | less
Note that when running your own workflow, you may opt to pipe your output to a .bam file to save the space and time of generating an intermediate file.
Once we are satisfied with the mapping, we will convert the SAM files to the compressed BAM format, and then order the BAM by coordinates for downstream analysis.
We run the command below to convert SAM to BAM
for SAMPLE_ID in S137 S144
do echo "converting $SAMPLE_ID"; samtools view -b -h -S bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.sam \
> bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.bam
done
Task: While that's running, let's look at what the b, h, and S flags indicate.
Task: We know the compressed format saves space, but how much less space does the BAM file take up?
We can use our handy ls -lh
to tell us.
ls -lht bam/
From this we learn that the .sam files are 134 MB each, while the .bam files are only 32 MB. Imagine how much space this might save over an entire sequencing project!
Note: You can streamline some of the previous steps to avoid generating intermediate files, but I chose to keep the individual steps for the purposes of showing a fully detailed workflow.
To remove PCR duplicates in the next step, we need to make sure the mate pair information and insert sizes are correct in our BAM using samtools fixmate
(sometimes previous filtering steps may remove one, but not both, paired reads). Software such as GATK requires the BAM file to be sorted by coordinates. Many programs that generate BAM files will automatically sort the output file, but it's good practice to double check this. We'll sort our files using samtools sort
.
for SAMPLE_ID in S137 S144
do echo "fixing $SAMPLE_ID"; samtools fixmate -m bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.bam bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.fix.bam
samtools sort -@ ${THREADS} -o bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.s.bam bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.fix.bam
done
As we did with FASTQC for our raw fastq reads, we can similarly check the quality of our mapped reads using Qualimap: http://qualimap.conesalab.org/. Before we run our command, we will have to change the default display of RStudio so that it outputs an .html file. This is not something you should have to do on your own computer if you run these commands in the future!
unset DISPLAY
qualimap bamqc -nt ${THREADS} -bam bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.s.bam -outdir quality_metrics
In the output, we see a striking statistic that only 7% of our reads aligned to this reference! Remember that these sequence reads represent an entire exome array, but we have only mapped the reads to a small fraction of the entire genome!
Finally, we will remove PCR duplicates using samtools markdup
and the -r
flag.
for SAMPLE_ID in S137 S144
do echo "removing dups from $SAMPLE_ID"; samtools markdup -r bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.s.bam \
bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.s.u.bam
done
Task: As a sanity check, let's look at the sizes of the files we have made in the last few steps. Which file do we expect to be the smallest?
BQSR relies on accurate read group information in order to discern technical issues related to quality scores in your sequence data. We can use shell variables to put together the different types of metadata, then add those to our BAM file.
We will need to define the following:
- RGID=String Read Group ID
- RGLB=String Read Group Library
- RGPL=String Read Group platform (e.g. illumina, solid)
- RGPU=String Read Group platform unit (eg. run barcode of run Id + lane number)
- RGSM=String Read Group sample name
Now that we know the specifications that BQSR needs, we can add those to our BAM files using picard
as implemented in gatk.
for SAMPLE_ID in S137 S144
do RUN_ID=170503_100PE_HS4KA
RGID_specify=${RUN_ID}_${LANE_NUM}
RGLB_specify=${SAMPLE_ID}
RGPL_specify=ILLUMINA
RGPU_specify=${RUN_ID}_${LANE_NUM}
RGSM_specify=${SAMPLE_ID}
gatk AddOrReplaceReadGroups --INPUT bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.s.u.bam \
--OUTPUT bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe_addRG.bam \
--SORT_ORDER coordinate \
--RGID ${RGID_specify} \
--RGLB ${RGLB_specify} \
--RGPL ${RGPL_specify} \
--RGPU ${RGPU_specify} \
--RGSM ${RGSM_specify} \
--CREATE_INDEX True \
--VALIDATION_STRINGENCY LENIENT
done
We can check that our command worked by comparing two BAM files for S144.
Before:
samtools view -H bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe.s.u.bam
After:
samtools view -H bam/${SAMPLE_ID}_L00${LANE_NUM}_aln-pe_addRG.bam
Task: What is the difference between these two files?
This is the first pass of the base quality score recalibration. The BaseRecalibrator
generates a recalibration table based on various covariates. The default covariates are read group, reported quality score, machine cycle, and nucleotide context. BQSR is most accurate at identifying technical artifacts when it is run with as many of your sequenced individuals as possible.
gatk BaseRecalibrator \
-I bam/S137_L00${LANE_NUM}_aln-pe_addRG.bam \
-I bam/S144_L00${LANE_NUM}_aln-pe_addRG.bam \
-R reference/NW_006501067.1.fa \
--known-sites reference/pman_n10_known_sites_NW_006501067.1.recode.vcf \
-O bqsr/pman_recal1_data.table
This generates a recalibration text file with information on things such as the quality scores of each read group for mismatches, insertions and deltions, or empirical qualities for each covariate used in the dataset (e.g. context and cycle). Again, you can read more about the format of this table on the gatk website: https://gatk.broadinstitute.org/hc/en-us/articles/360035890531, and the tool documentation here.
Next, we apply this recalibration info to our original BAM file to make a new BAM with recalibrated quality scores.
gatk ApplyBQSR \
-R reference/NW_006501067.1.fa \
-I bam/S137_L00${LANE_NUM}_aln-pe_addRG.bam \
-I bam/S144_L00${LANE_NUM}_aln-pe_addRG.bam \
--bqsr-recal-file bqsr/pman_recal1_data.table \
-O bqsr/pman_recal1_output.bam
We'll do the above steps one more time to generate a count of covariates on our recalibrated data, then we will compare our before and after BQSR results.
gatk BaseRecalibrator \
-I bqsr/pman_recal1_output.bam \
-R reference/NW_006501067.1.fa \
--known-sites reference/pman_n10_known_sites_NW_006501067.1.recode.vcf \
-O bqsr/pman_recal2_data.table
gatk ApplyBQSR \
-R reference/NW_006501067.1.fa \
-I bam/S137_L00${LANE_NUM}_aln-pe_addRG.bam \
-I bam/S144_L00${LANE_NUM}_aln-pe_addRG.bam \
--bqsr-recal-file bqsr/pman_recal2_data.table \
-O bqsr/pman_recal2_output.bam
We can use a gatk tool called AnalyzeCovariates
to see how the first round of BQSR affected our data. We can look at some example plots on the gatk website to get an idea for what our before and after plots should look like: https://gatk.broadinstitute.org/hc/en-us/articles/360035890531
gatk AnalyzeCovariates \
-before bqsr/pman_recal1_data.table \
-after bqsr/pman_recal2_data.table \
-plots bqsr/AnalyzeCovariates_round1.pdf
We may choose to do additional rounds of BQSR until we are satisfied with the patterns we observe in the covariate 'before' and 'after' plots. At the end of Step 5 we should have analysis-ready BAM files.
From the GATK website: "The HaplotypeCaller calls SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. HaplotypeCaller runs per-sample to generate an intermediate GVCF (not to be used in final analysis), which can then be used in GenotypeGVCFs for joint genotyping of multiple samples in a very efficient way."
for SAMPLE_ID in S137 S144
do echo "running ${SAMPLE_ID}"; gatk HaplotypeCaller \
-I bqsr/pman_recal2_output.bam \
--sample-name ${SAMPLE_ID} \
-R reference/NW_006501067.1.fa \
--emit-ref-confidence GVCF \
-O raw_genotypes_HC/${SAMPLE_ID}.raw.snps.indels.g.vcf
done
Don't be alarmed if you get some warnings output to the screen. There was a known bug with this version of GATK that has since then been fixed! This will take a few miutes to run.
We should now have a GVCF file for each individual. We can take a look at the files, but don't be alarmed if they say there are no genotypes yet! We'll do that soon. The GVCF mode and GVCF files can be read about in more detail on the documentation page for HaplotypeCaller.
This step consists of consolidating the contents of GVCF files across all of your samples in order to improve scalability and speed the next step, joint genotyping. We will use the GenomicsDBImport
tool and will need to make a new temporary directory.
mkdir called_genotypes_HC/temp
gatk GenomicsDBImport \
--variant raw_genotypes_HC/S137.raw.snps.indels.g.vcf \
--variant raw_genotypes_HC/S144.raw.snps.indels.g.vcf \
--genomicsdb-workspace-path called_genotypes_HC/pman_database \
-L NW_006501067.1
We're finally ready to call and output joint genotypes for our samples using GATK GenotypeGVCFs. I like to use the '--include-non-variant-sites' to output all sites with called data. This is an important step to consider in your own workflow as it will generate a larger data file but may provide useful information for downstream analyses.
gatk GenotypeGVCFs \
-R reference/NW_006501067.1.fa \
-V gendb://called_genotypes_HC/pman_database \
--include-non-variant-sites \
-O raw_genotypes_HC/pman_HC_genotypes_unfiltered.vcf.gz
At this point in our analysis, we should have a set of all called positions on the NW_006501067.1 scaffold of the deer mouse genome. If we look at the file more closely, though, we see some positions that were only called in one of the samples, or others that seem to have a pretty low depth of coverage. In the next step, let's do some "hard" filtering of positions. A shortcoming of working with downsampled data is that there weren't enough variable sites in our output from GenotypeGVCFs
, so we'll be working with a different vcf file for the filtering.
Filtering a set of SNPs is a delicate balance between too stringent and too lenient; you should decide what works best for your type of data and your analytical goals. For a discussion of a few aspects of filtering that you could consider, see the section "Navigating the Perils of Data Filtering" in a previous ConGen review (Schweizer et al. 2021), or stay tuned for discussions later this week led by Will Hemstrom and others (e.g., Hemstrom, Grummer, et al. 2024 Nature Reviews Genetics).
For this part of the pipeline, we're going to use a different vcf file that has additional individuals and more variable site. At the beginning of this session, we moved this file and its index, to our raw_genotypes_HC directory.
We will do some fairly simple filtering, but enough so that you have an idea of the tools and metrics you can use to filter your own data.
First, let's take a look at our vcf file and notice a few things.
zcat raw_genotypes_HC/pman_n10_congenSubset.vcf.gz | grep -v "#" | less
- The first thing to notice is that this file has both SNPs and indels. A benefit of the GATK HaplotypeCaller algorithm is that it can determine SNPs and indels simultaneously as it does local realignment of the sequences. However, for our purposes, we do not need indels, so we'll filter them out in the next step.
- The HaplotypeCaller also has a number of position-specific and individual-specific default annotations that it performs, such as the strand bias (uneven amplifications of DNA strands), the allele depth (number of reads mapping to each allele), depth of coverage, genotype quality, etc. You can find out what these annotations are by looking in the header or on the GATK website: https://gatk.broadinstitute.org/hc/en-us/articles/360047216111--Tool-Documentation-Index#VariantAnnotations
There are a few options for filtering vcf files. gatk has two tools called VariantFiltration
which filters variant calls based on INFO and/or FORMAT annotations and SelectVariants
which selects a subset of variants from a VCF file. We can also use vcftools, which we learned about during the bioinformatics session.
Now, we'll select only the SNPs from the file.
gatk SelectVariants \
-R reference/NW_006501067.1.fa \
-O filtered_genotypes_HC/pman_n10_congenSubset_raw_SNPs.vcf \
-V raw_genotypes_HC/pman_n10_congenSubset.vcf.gz \
--select-type-to-include SNP
Using vcftools, we can assess aspects of our data such as the mean depth of coverage per individual or per site, or the number of missing sites per individual or number of individuals missing from each site. Once we have an idea of what our data look like, we can decide what are appropriate filters.
for flag in missing-indv missing-site depth site-depth
do vcftools --vcf filtered_genotypes_HC/pman_n10_congenSubset_raw_SNPs.vcf \
--${flag} \
--out quality_metrics/pman_n10_congenSubset_raw_SNPs
done
The above for loop generates some summary statistics on missingness and depth of coverage across our samples and our SNPs.
For example, we might choose to filter out sites below a certain genotype quality (--minGQ
), below a minimum depth of coverage (minDP
), retain sites that are bi-allelic, and those sites that are called in 75% of our individuals (max-missing
).
vcftools --vcf filtered_genotypes_HC/pman_n10_congenSubset_raw_SNPs.vcf \
--minDP 10 \
--minGQ 20 \
--max-missing 0.75 \
--min-alleles 2 \
--max-alleles 2 \
--recode \
--out filtered_genotypes_HC/pman_n10_congenSubset_minDP10_minGQ20_biAllelic_maxMissing75
You might check your filters by examining genotype concordance (if you have a set of known variants, e.g. from a SNP array) with vcftools --diff-discordance-matrix
or by measuring the transition/transversion ratio with vcftools --TsTv
The Broad's GATK website is an excellent resource, with very helpful tutorials, detailed manuals, and a forum for asking questions. https://gatk.broadinstitute.org/hc/en-us
Hendricks S, Anderson EC, Antão T, Bernatchez L, Forester BR, Garner B, Hand BK, Hohenlohe PA, Kardos M, Koop B, et al. 2018. Recent advances in conservation and population genomics data analysis. Evolutionary Applications 11:1197–1211.
Li H, Durbin R. 2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589.
Schweizer RM, Saarman N, Ramstad KM, Forester BR, Kelley JL, Hand BK, Malison RL, Ackiss AS, Watsa M, Nelson TC, et al. 2021. Big Data in Conservation Genomics: Boosting Skills, Hedging Bets, and Staying Current in the Field. Journal of Heredity 112:313–327.
Shafer ABA, Peart CR, Tusso S, Maayan I, Brelsford A, Wheat CW, Wolf JBW. 2016. Bioinformatic processing of RAD-seq data dramatically impacts downstream population genetic inference.Gilbert M, editor. Methods in Ecology and Evolution 8:907–917.