/ClinicalGradeDNAseq

Automated next generation DNA sequencing analysis pipeline suited for clinical tests, with >99.9% sensitivity to Sanger sequencing at read-depth>18

Primary LanguageShellGNU General Public License v3.0GPL-3.0

NB - this code / pipeline structure remains mostly untouched since it was first written by me in 2013 / 14 while working in the National Health Service (NHS) England - it would be designed differently were it written today, and likely using a proper workflow manager. The main areas where it could be developed are:

  1. improved command line parameter parsing
  2. improved error handling, for example, checking output of each code section and making a decision to proceed or not

Although it runs AOK in its current state, anyone re-using this code should make changes as you see fit.

I update this pipeline as I re-use it myself in order to keep it maintained on a low level in line with new program versions that are released.

Last update: Sunday, 31st March, 2019 @ 02:20 BST (GMT+1)

ClinicalGradeDNAseq - random read sampling to improve sensitivity

Automated next generation DNA sequencing analysis pipeline 'suited' for clinical tests, with >99.9% sensitivity to Sanger sequencing for Single Nucleotide Variants (SNVs) at read-depth>18 over target regions over interest.

This clinical-grade analysis pipeline, ClinicalGradeDNAseq, is a watered-down and modified version of the work by Blighe, Beauchamp, and colleagues at Sheffield Diagnostic Genetics Service, Sheffield Children's NHS Foundation Trust, Sheffield, UK, and their efforts to introduce a clinical-grade next generation sequencing (NGS) analysis pipeline fully validated against Sanger di-deoxy sequencing.

The pipeline is built using open source programs mixed with customised scripts. A wrapper script manages command line parameters and then executes the master analysis script. Control is then returned to the wrapper, where results files are transferred to a remote server via SSH/sFTP. A master and concise log is kept, with date- and time-stamps. Results directory structure is formed based on the run number and patient ID.

[Key point] The unique feature of the analysis pipeline that increases sensitivity to Sanger sequencing is in the variant calling step, where a final aligned BAM is split into 3 'sub-BAMs' of 75%, 50%, and 25% random reads. Variants are then called on all 4 BAMs, after which a consensus VCF is produced.

This pipeline proceeds in an 8-step process:

  1. Adaptor and read quality trimming - TrimGalore! (Krueger F), FastQC (Andrews S), cutadapt (Martin M, 2011)
  2. Alignment - bwa mem (Li & Durbin, 2009)
  3. Marking and removing PCR duplicates - Picard (Broad Institute of MIT and Harvard), SAMtools (Li et al., 2009)
  4. Remove low mapping quality reads - SAMtools (Li et al., 2009)
  5. QC - SAMtools (Li et al., 2009), BEDTools (Quinlan & Hall, 2010), custom scripts
  6. Downsampling / random read sampling - Picard (Broad Institute of MIT and Harvard)
  7. Variant calling - SAMtools/BCFtools (Li et al., 2009)
  8. Annotation - Variant Effect Predictor (McLaren et al., 2016)

Requirements

  • Paired-end FASTQ files
  • Chromosome-ordered and position-sorted BED file in hg19 / hg38 (BED files can be sorted with sort -k1,1V -k2,2n)
  • Global installations of cutadapt and unix2dos (included in dos2unix)
  • Valid credentials for returning results files to remote server (username / password) via SSH/sFTP

Execution

    Run the ‘PipelineWrapper’ wrapper script, which will check command-line parameters, execute the master script, and then return results files via SSH/sFTP to remote server. Use the following parameters:
    1. FASTQ mate-pair 1 (absolute file path)
    2. FASTQ mate-pair 2 (absolute file path)
    3. Reference genome FASTA (absolute file path)
    4. Run number (e.g. Plate6, Plate7, etc.) (alphanumeric)
    5. Patient ID (alphanumeric)
    6. BED file (absolute file path)
    7. Minimum quality for bases at read ends, below which bases will be cut (integer)
    8. Minimum allowed read length (integer)
    9. Adaptor for trimming off read ends ('illumina' / 'nextera' / 'small_rna')
    10. Minimum read depth for calling a variant (integer)
    11. Minimum allowed mapping quality (integer)
    12. Stringency for calling variants ('relaxed' / 'normal') (relaxed uses --pval-threshold 1.0 with BCFtools call)
    13. Directory where results will be output (absolute file path)
    14. User initials (alphanumeric)

    Output

    Results files are output locally to [results root]/[run number]/[sample ID]/, with a copy being also sent via SSH/sFTP to a remote server.
    • *_AnalysisLog.txt - analysis log (short)
    • Master.log - analysis log (comprehensive)
    • *_R1_001.fastq.gz_trimming_report.txt - details on base and read trimming for mate-pair 1
    • *_R1_001_val_1_fastqc.html - FastQC report for mate-pair 1 (after trimming)
    • *_R2_001.fastq.gz_trimming_report.txt - details on base and read trimming for mate-pair 2
    • *_R2_001_val_2_fastqc.html - FastQC report for mate-pair 2 (after trimming)
    • *_Alignment.txt - alignment metrics
    • *_ReadsOffTarget.txt - number of reads falling outside regions specified in BED file
    • *_PCRDuplicates.txt - details on identified PCR duplicates
    • *_CoverageTotal.bedgraph - coverage for all mapped locations (contiguous bases at same read depth are merged into regions)
    • *_MeanCoverageBED.bedgraph - mean read depth for each region specified in supplied BED file
    • *_PerBaseDepthBED.bedgraph - per base read depth for each base in each region specified in supplied BED file
    • *_PercentGenomeCovered.txt - percentage of reference genome covered by reads.
    • *_Aligned_Sorted_PCRDuped_FiltMAPQ.bam - aligned BAM file with sorted reads, PCR duplicates removed, and reads below mapping quality threshold removed
    • *_Aligned_Sorted_PCRDuped_FiltMAPQ.bam.bai - index for above BAM file
    • *_Final.vcf - final VCF file
    • *_AnnotationVEP.html - HTML report of variant annotation, with consequences for all known transcript isoforms
    • *_AnnotationVEP.tsv - as above but in tab-separated values (TSV) format

    Hard-coded sections of code

    • PipelineWrapper.sh, line 120: /home/ubuntu/pipeline/AnalysisMasterVersion1.sh "${Read1}" "${Read2}" ... - absolute path filename for AnalysisMasterVersion1.sh
    • PipelineWrapper.sh, line 138: remoteDir="/remote/SAMBA/share/" - Remote server directory to which results files will be transferred via SSH/sFTP
    • PipelineWrapper.sh, line 150, 161: sshpass -e sftp $username@XXX.XXX.XXX.XXX << ! - Remote server IP address or host name to which results files will be transferred via SSH/sFTP
    • AnalysisMasterVersion1.sh, lines 25-35 - root directories (absolute paths) of required programs
    • AnalysisMasterVersion1.sh, lines 286/304 - minimum base quality (--min-BQ) set to 30 for BCFtools mpileup
    • AnalysisMasterVersion1.sh, lines 374 - extra filters for filtering variants via vcfutils.pl (bundled with BCFtools in 'misc' directory)
    • AnalysisMasterVersion1.sh, line 396/7 - species and assembly for VEP set to Homo sapiens and GRCh38

    References

    • Andrews S, FastQC, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/, last accessed 28th August 2017.
    • Broad Institute of MIT and Harvard, Picard, http://broadinstitute.github.io/picard/, last accessed 28th August 2017
    • Krueger F, Trim Galore!, https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/, last accessed 28th August 2017.
    • Li H and Durbin R (2009), Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 25(14): 1754–1760.
    • Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup (2009), The Sequence Alignment/Map format and SAMtools, Bioinformatics 25(16): 2078-9.
    • Martin M (2011), Cutadapt removes adapter sequences from high-throughput sequencing reads, EMBnet.journal 17(1): 10-12.
    • McLaren W, Gil L, Hunt S, Riat H, Ritchie G, Thormann A, Flicek P, Cunningham F (2016), The Ensembl Variant Effect Predictor, Genome Biology 17: 122.
    • Quinlan AR & Hall IM (2010), BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics 26(6): 841-2.

    Credits

    • Kevin Blighe (Sheffield Children's NHS Foundation Trust)
    • Nick Beauchamp (Sheffield Children's NHS Foundation Trust)
    • Darren Grafham (Sheffield Children's NHS Foundation Trust)
    • Lucy Crookes (Sheffield Children's NHS Foundation Trust)
    • Sasirekha Palaniswamy ('Sashi') (Sheffield Children's NHS Foundation Trust)
    • Sheffield Diagnostics Genetics Service