A pipeline to automate DNA sequence aligning and quality control workflows via list-based batch submission and parallel processing
For the latest updates and to chat with our team, please join our Slack workspace: sequencehandling.slack.com.
For greater detail, usage information, and troubleshooting please see the
sequence_handling
wiki.
sequence_handling
is a series of scripts, called handlers, that automate and speed up DNA sequence alignment and quality control. Currently, sequence_handling
is designed to work with Illumina paired-end whole-genome or exome capture sequences. Some parts of sequence_handling
can accept GBS sequences.
The workflow is intended to be 100% reproducible provided that you have the Config file and version of sequence_handling
that was used. It is also intended to be easy for beginner UNIX users to configure and run independently, given that they read the wiki.
NOTE: This workflow is designed to use the Portable Batch System and run on the Minnesota Supercomputing Institute. Heavy modifications will need to be made if not using these systems.
Due to the pseudo-modularity of this workflow, dependencies for each individual handler are listed below. Some general dependencies for the workflow as a whole are also listed here:
- GNU Parallel
- A quality control mechanism, such as FastQC
- An adapter trimmer, such as Scythe
- A read mapper, such as The Burrows-Wheeler Aligner (BWA)
- SAM file processing utilities, such as SAMTools and/or Picard
- Tools for plotting coverage, such as R
- Tools for variant calling, such as the Genome Analysis Toolkit
- Tools for filtering and manipulating VCF files, such as VCFtools and vcflib
Please note that this is not a complete list of dependencies. Check the dependencies wiki page for dependencies for each handler.
A guide on how to install, set up, and run sequence_handling
for the first time can be found on the Getting Started page of the wiki.
A brief usage message can be viewed by passing no arguments to sequence_handling
:
./sequence_handling
To run sequence_handling
, use the following command, assuming you are in the sequence_handling
directory:
./sequence_handling <handler> Config
Where <handler>
is one of the handlers listed below and Config
is the full file path to the configuration file.
For any handler that utilizes PBS job arrays, there is an optional flag, -t custom_array_indices
that you can use to re-run specific job array indices that errored out or were aborted. The custom_array_indices
is a range of arrays and/or comma separated list of specific arrays to run WITHOUT spaces in between. Currently, the -t
flag must be provided as the 3rd argument on the command line (see example below). If left blank (you do not use the -t
flag), the DEFAULT runs all samples in your sample list. This is helpful if only some of your jobs arrays fail and you need to re-run only those.
Here is an example using the -t flag and SAM_Processing handler:
./sequence_handling SAM_Processing /path/to/config -t 1-5,10,12
To start, run Quality_Assessment on your raw FastQ files. The Quality_Assessment handler runs FastQC on a series of samples and outputs metrics used for quality control. It accepts FASTQ, SAM, and BAM files as input and outputs a summary table and individual HTML files for visualization. The Quality_Assessment handler depends on FastQC and GNU Parallel.
The Adapter_Trimming handler uses Scythe to trim specific adapter sequences from FastQ files. This handler differentiates between forward, reverse, and single-end FastQ files automatically. The Adapter_Trimming handler depends on Scythe and GNU Parallel.
After Adapter_Trimming, it is recommended to run Quality_Assessment again on the trimmed FastQ files to ensure that all adapter contamination was properly removed.
3. Read_Mapping
The Read_Mapping handler maps sequence reads to a reference genome using BWA-MEM. This handler uses Torque Task Arrays, part of the Portable Batch System. The Read_Mapping handler depends on the Burrows-Wheeler Aligner.
The SAM_Processing handler converts the SAM files from read mapping with BWA to the BAM format using SAMTools. In the conversion process, it will sort and deduplicate the data for the finished BAM file, also using SAMTools. Alignment statistics will also be generated for both raw and finished BAM files. The SAM_Processing handler depends on SAMTools and GNU Parallel.
The Coverage_Mapping handler generates coverage histograms and summary statistics from BAM files using BEDTools. Plots of coverage are generated using R based on coverage maps. The Coverage_Mapping handler depends on BEDTools, R, and GNU Parallel.
To begin the variant discovery process from your finished BAM files, the Haplotype_Caller handler uses GATK to generate genomic VCF files for each sample.
Import GVCF files output from Haplotype_Caller into a GenomicsDB workspace. Genomics_DB_Import in GATK 4 merges GVCF files from multiple samples before joint genotyping (Note: it has the same functionality as CombineGVCFs in previous versions of GATK). Because this step pools together all of your samples into one file, it is essential that all samples are included for this step. Automatically breaking the process into chromosome parts or smaller regions for each chromosome allows the job to be "parallelized across regions" by running as a task array and speeds up computing time.
The Genotype_GVCFs hander converts the GVCF files for the entire dataset into VCF files broken up by chromosome or chromosome part using GATK. Breaking the output into chromosome parts allows the process to be split into a task array and greatly speeds up processing time.
The Create_HC_Subset handler creates a single VCF file that contains only the high-confidence sites for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. Create_HC_Subset depends on VCFtools and vcflib for manipulating the VCF file.
The Variant_Recalibrator handler uses the GATK and user-provided prior sets of "truth" variants to create a model that attempts to separate true variants from false positives. An unfiltered VCF file the the FILTER field annotated is generated.
The Variant_Filtering handler creates a single variant call format (VCF) file that contains only high-quality sites and genotypes for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. Variant_Filtering depends on VCFtools and vcflib for manipulating the VCF file.
12. Variant_Analysis
The Variant_Analysis handler uses a variety of dependencies to produce statistics about the input VCF file. Information generated by the handler includes heterozygosity summaries, missing-ness summaries, a minor allele frequency histogram, the Ts/Tv ratio, and the raw count of SNPs. Additional information is output for barley samples. Variant_Analysis depends on VCFtools, vcflib, molpopgen, Python3, GNU Parallel, BCFtools, R, TeX Live, and the Enthought Python Distribution.
Please check the Common Problems and Errors page on the wiki for help and visit the wiki page for the handler(s) you're using. If you are still having difficulties, please submit an issue through the Git issues page for this repository.
The following handlers are planned for future versions of sequence_handling
.
The GBS_Demultiplexing handler will demulitplex raw GBS reads into split FastQ files using the FASTX-Toolkit. Code for GBS_Demultiplexing is already present in sequence_handling
for the purposes of collaborative alpha testing, but should not be used since it's extremely buggy.
If you feel like there are tools or alternative processing techniques missing from sequence_handling
, please submit a feature request through the Git issues page.
sequence_handling
can be cited like:
Version 3:
Chaochih Liu, Paul J. Hoffman, Skylar R. Wyant, Emily Dittmar, Naoki Takebayashi, Samuel Hamann, Li Lei, Peter L. Morrell. MorrellLAB/sequence_handling: Release v3.0: SNP calling with GATK 4.1 and Slurm compatibility. Zenodo. https://doi.org/10.5281/zenodo.6608661
Version 2:
Paul J. Hoffman, Skylar R. Wyant, Thomas J.Y. Kono, & Peter L. Morrell. (2018, June 1). MorrellLAB/sequence_handling: Release v2.0: SNP calling with GATK 3.8 (Version v2.0). Zenodo. http://doi.org/10.5281/zenodo.1257692
Version 1:
Hoffman PJ, Wyant SR, Kono TJY, Morrell PL. (2018). sequence_handling: A pipeline to automate sequence workflows. Zenodo. https://doi.org/10.5281/zenodo.1257692