/sequence_handling

A series of scripts to automate sequence workflows

Primary LanguageJupyter Notebook

sequence_handling

DOI

A pipeline to automate DNA sequence aligning and quality control workflows via list-based batch submission and parallel processing

For the latest updates and to chat with our team, please join our Slack workspace: sequencehandling.slack.com.


For greater detail, usage information, and troubleshooting please see the sequence_handling wiki.

What is sequence_handling for?

sequence_handling is a series of scripts, called handlers, that automate and speed up DNA sequence alignment and quality control. Currently, sequence_handling is designed to work with Illumina paired-end whole-genome or exome capture sequences. Some parts of sequence_handling can accept GBS sequences.

The workflow is intended to be 100% reproducible provided that you have the Config file and version of sequence_handling that was used. It is also intended to be easy for beginner UNIX users to configure and run independently, given that they read the wiki.

NOTE: This workflow is designed to use the Portable Batch System and run on the Minnesota Supercomputing Institute. Heavy modifications will need to be made if not using these systems.

Dependencies

Due to the pseudo-modularity of this workflow, dependencies for each individual handler are listed below. Some general dependencies for the workflow as a whole are also listed here:

Please note that this is not a complete list of dependencies. Check the dependencies wiki page for dependencies for each handler.

Basic Usage

A guide on how to install, set up, and run sequence_handling for the first time can be found on the Getting Started page of the wiki.

A brief usage message can be viewed by passing no arguments to sequence_handling:

./sequence_handling

To run sequence_handling, use the following command, assuming you are in the sequence_handling directory:

./sequence_handling <handler> Config

Where <handler> is one of the handlers listed below and Config is the full file path to the configuration file.

For any handler that utilizes PBS job arrays, there is an optional flag, -t custom_array_indices that you can use to re-run specific job array indices that errored out or were aborted. The custom_array_indices is a range of arrays and/or comma separated list of specific arrays to run WITHOUT spaces in between. Currently, the -t flag must be provided as the 3rd argument on the command line (see example below). If left blank (you do not use the -t flag), the DEFAULT runs all samples in your sample list. This is helpful if only some of your jobs arrays fail and you need to re-run only those.

Here is an example using the -t flag and SAM_Processing handler:

./sequence_handling SAM_Processing /path/to/config -t 1-5,10,12

Recommended Workflow

Workflow

To start, run Quality_Assessment on your raw FastQ files. The Quality_Assessment handler runs FastQC on a series of samples and outputs metrics used for quality control. It accepts FASTQ, SAM, and BAM files as input and outputs a summary table and individual HTML files for visualization. The Quality_Assessment handler depends on FastQC and GNU Parallel.

The Adapter_Trimming handler uses Scythe to trim specific adapter sequences from FastQ files. This handler differentiates between forward, reverse, and single-end FastQ files automatically. The Adapter_Trimming handler depends on Scythe and GNU Parallel.

After Adapter_Trimming, it is recommended to run Quality_Assessment again on the trimmed FastQ files to ensure that all adapter contamination was properly removed.

The Read_Mapping handler maps sequence reads to a reference genome using BWA-MEM. This handler uses Torque Task Arrays, part of the Portable Batch System. The Read_Mapping handler depends on the Burrows-Wheeler Aligner.

The SAM_Processing handler converts the SAM files from read mapping with BWA to the BAM format using SAMTools. In the conversion process, it will sort and deduplicate the data for the finished BAM file, also using SAMTools. Alignment statistics will also be generated for both raw and finished BAM files. The SAM_Processing handler depends on SAMTools and GNU Parallel.

The Coverage_Mapping handler generates coverage histograms and summary statistics from BAM files using BEDTools. Plots of coverage are generated using R based on coverage maps. The Coverage_Mapping handler depends on BEDTools, R, and GNU Parallel.

To begin the variant discovery process from your finished BAM files, the Haplotype_Caller handler uses GATK to generate genomic VCF files for each sample.

Import GVCF files output from Haplotype_Caller into a GenomicsDB workspace. Genomics_DB_Import in GATK 4 merges GVCF files from multiple samples before joint genotyping (Note: it has the same functionality as CombineGVCFs in previous versions of GATK). Because this step pools together all of your samples into one file, it is essential that all samples are included for this step. Automatically breaking the process into chromosome parts or smaller regions for each chromosome allows the job to be "parallelized across regions" by running as a task array and speeds up computing time.

The Genotype_GVCFs hander converts the GVCF files for the entire dataset into VCF files broken up by chromosome or chromosome part using GATK. Breaking the output into chromosome parts allows the process to be split into a task array and greatly speeds up processing time.

The Create_HC_Subset handler creates a single VCF file that contains only the high-confidence sites for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. Create_HC_Subset depends on VCFtools and vcflib for manipulating the VCF file.

The Variant_Recalibrator handler uses the GATK and user-provided prior sets of "truth" variants to create a model that attempts to separate true variants from false positives. An unfiltered VCF file the the FILTER field annotated is generated.

The Variant_Filtering handler creates a single variant call format (VCF) file that contains only high-quality sites and genotypes for your samples. This filtering is performed in multiple steps using several different user-defined parameters and before-and-after percentile tables are generated. Variant_Filtering depends on VCFtools and vcflib for manipulating the VCF file.

The Variant_Analysis handler uses a variety of dependencies to produce statistics about the input VCF file. Information generated by the handler includes heterozygosity summaries, missing-ness summaries, a minor allele frequency histogram, the Ts/Tv ratio, and the raw count of SNPs. Additional information is output for barley samples. Variant_Analysis depends on VCFtools, vcflib, molpopgen, Python3, GNU Parallel, BCFtools, R, TeX Live, and the Enthought Python Distribution.

Troubleshooting

Please check the Common Problems and Errors page on the wiki for help and visit the wiki page for the handler(s) you're using. If you are still having difficulties, please submit an issue through the Git issues page for this repository.

Future Handlers

The following handlers are planned for future versions of sequence_handling.

GBS_Demultiplexing

The GBS_Demultiplexing handler will demulitplex raw GBS reads into split FastQ files using the FASTX-Toolkit. Code for GBS_Demultiplexing is already present in sequence_handling for the purposes of collaborative alpha testing, but should not be used since it's extremely buggy.

Coverage Mapping histograms using R

Longranger support for 10x reads

Tools for QC and mapping long nanopore reads with minimap2

Other suggestions

If you feel like there are tools or alternative processing techniques missing from sequence_handling, please submit a feature request through the Git issues page.

Citation

sequence_handling can be cited like:

Version 3:

Chaochih Liu, Paul J. Hoffman, Skylar R. Wyant, Emily Dittmar, Naoki Takebayashi, Samuel Hamann, Li Lei, Peter L. Morrell. MorrellLAB/sequence_handling: Release v3.0: SNP calling with GATK 4.1 and Slurm compatibility. Zenodo. https://doi.org/10.5281/zenodo.6608661

Version 2:

Paul J. Hoffman, Skylar R. Wyant, Thomas J.Y. Kono, & Peter L. Morrell. (2018, June 1). MorrellLAB/sequence_handling: Release v2.0: SNP calling with GATK 3.8 (Version v2.0). Zenodo. http://doi.org/10.5281/zenodo.1257692

Version 1:

Hoffman PJ, Wyant SR, Kono TJY, Morrell PL. (2018). sequence_handling: A pipeline to automate sequence workflows. Zenodo. https://doi.org/10.5281/zenodo.1257692