Fork of gatk4-germline-snps-indels. Reason for fork is to add multi-lane (undefined amount) support per single sample.
Multi-lane support is added similarly as WARP pipelines - Exome Germline Single Sample Overview does it. MarkDuplicates process combines all lane files per sample.
- added extensive README.md in example folder for analysis example (with data)
- multi-lane/multilane/multiple lanes/multiple lane files support
-
input_fofn (variable in params file) format is changed
-
lane column is added, headers now are:
readgroup_name | sample_name | fastq_1 | fastq_2 | lane | library_name | platform_unit | run_date | platform_name | sequencing_center
-
-
GATK_MERGE_BAM_ALIGNMENT.out channel is edited by combining all lane and lane file information under same sample by groupTuple()
-
GATK_MARK_DUPLICATES module/mark_duplicates.nf process script is edited to add prefix (--INPUT, and suffix space), so that multiple bam files each have the form "--INPUT bam1 --INPUT bam2" etc.
-
all workflow processess prior to GATK_MARK_DUPLICATES have input value lane or lanes added as well as in output value(s)
-
tmp_dir variable is added to params.yaml as user defined path
- this is specific to GATK_SORT_AND_FIX_TAGS module/process in preprocessing_mapping.nf workflow
- when executing GATK_SORT_AND_FIX_TAGS process in local setting the server/machine /tmp runs out of space, --TMP_DIR parametre is added to gatk SortSam and SetNmMdAndUqTags commands
-
- Write local+singularity example for exome sequencing data in this README.md
- Make full instructions with IGSR/1000genomes exome data as example
- special note on creating scattered interval lists from your target interval list + inconsistancies
- give examples of channel handling with printed outputs
- string & array handling in process with groovy-lang
- Make full instructions with IGSR/1000genomes exome data as example
- Update used containers
- test
- Test with SLURM
- Write documentation
- Test with T2T v1.1 genome assembly
- Test with GRCh37 genome assembly ?
- Joint-calling ?
- Add Sentieon support ?
- multi-lane fastq support
- modify input file table, fofn format
- add lane column to fofn sample table file (one sample can have multiple lanes, one lane per input (fofn) table line)
- pass lane value in processes and workflows previous to MarkDuplicates
- modify MarkDuplicates task input to gather all previously processed lane files per sample
- test with IGSR/1000 Genomes sample(s) or equivalent
- modify input file table, fofn format
Workflow for germline short variant discovery using GATK4 written as per the Nextflow DSL2 best-practices.
This pipeline was developed with significant contributions by Diamond Age Data Science, you can read more about the infrastructure usage on the AWS blogpost.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with Docker containers making installation trivial and results highly reproducible.
To run this pipeline locally on your machine, install Nextflow and use the following command:
nextflow run 'https://github.com/seqeralabs/gatk4-germline-snps-indels' \
-params-file params/params.yaml \
-profile docker
-latest
Alternatively, this pipeline is readily executable with Nextflow Tower. It is availble in the Community Showcase workspace at tower.nf. While launching the workflow, please make sure (i) to specify fill the correct parameters via the params
field and (ii) to Config profile
field to docker
.
The code organization is as follows.
$ tree -L 2
.
├── README.md
├── codespaces.md
├── conf
│ └── optimized_processes.config
├── containers
│ ├── Readme.md
│ ├── build.sh
│ └── gatk4
├── main.nf
├── modules
│ ├── bwa
│ ├── gatk
│ └── picard
├── nextflow.config
├── params
│ ├── params.yaml
│ └── test_params.yaml
├── resources
├── test_data
└── workflows
├── format_conversion
├── preprocessing_mapping
├── quality_recalibration
└── variant_discovery
- The
workflows
folder containers the (sub-)workflows in its own unique folder
workflows/
└── format_conversion
├── format_conversion.nf
├── nextflow.config
└── test_params.yaml
- The
modules
are designed around the idea of a specific tool (for examplegatk
) as the core abstraction, which contains wrappers for various sub-commands forgatk
.
modules/gatk/
│
├── paired_fastq_to_unmapped_bam
│ ├── README.md
│ ├── nextflow.config
│ ├── paired_fastq_to_unmapped_bam.nf
│ ├── test_data
│ │ ├── SRR047732_1.10000_reads.filt.fastq.gz
│ │ └── SRR047732_2.10000_reads.filt.fastq.gz
│ └── test_params.yaml
- We have used default parameters for each module, making it executable in a standalone manner. In the
modules/gatk/paired_fastq_to_unmapped_bam/paired_fastq_to_unmapped_bam.nf
, the parameters have been initialized with default values.
params.gatk_path = "/gatk/gatk"
params.java_opts = ""
-
We have named the
process
within these modules following theTOOLNAME_PROCESSNAME
pattern. For example, in themodules/gatk/paired_fastq_to_unmapped_bam/paired_fastq_to_unmapped_bam.nf
, the name of the wrapped process isGATK_PAIRED_FASTQ_TO_UNMAPPED_BAM
. This is helpful when we are stitching these processes together in a workflow file. -
At the workflow level, we have used
namespaced
parameters to override the defaults for a module as shown below
// NOTE: The param override must be above the inclusion of the module.
params.GATK_PAIRED_FASTQ_TO_UNMAPPED_BAM = [
java_opts: "-Xms3000m"
include { GATK_PAIRED_FASTQ_TO_UNMAPPED_BAM } from "../../modules/gatk/paired_fastq_to_unmapped_bam/paired_fastq_to_unmapped_bam.nf" addParams (params.GATK_PAIRED_FASTQ_TO_UNMAPPED_BAM)
This allows us to namespace
the parameters for every module either from (i) the workflow file (ii) nextflow.config
(iii) or the params.yaml
file.
Here's how a yaml
file would look like, with this pattern
GATK_PAIRED_FASTQ_TO_UNMAPPED_BAM:
java_opts:
"-Xms3000m"
-
Use the
modules/utils
module to keep the ad-hoc utilities and name them with the same pattern, i.e.UTILS_PROCESSNAME
-
You can use the
containers
folder to build the individual containers.
It is our recommendation that all pipeline related parameters like input/output location, tool oriented options should be in the yaml
file and all executor/container level optimization and overrides should be in the config
file.
An example params.yaml
is shown below
fasta:
"s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta"
dbSNP_vcf:
"s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf"
An example nextflow.config
file is shown below
manifest {
name = 'GATK4 Germline SNP and Indel Analysis'
description = 'Workflow for germline short variant discovery using GATK4'
version = '0.0.1'
mainScript = 'main.nf'
defaultBranch = 'master'
homePage = 'https://github.com/seqeralabs/gatk4-germline-snps-indels'
nextflowVersion = '>=21.04.1'
}
process.errorStrategy = 'retry'
process.maxRetries = 3
profiles {
dockerhub {
process {
errorStrategy = 'retry'
ext.registry = 'https://hub.docker.com/r/seqeralabs/'
}
docker {
enabled = true
fixOwnership = true
}
}
}
Module
level tests
These are conceptually similar to the traditional Unit test
and the modules
should be designed to be testable individually. For a reference please take a look at /modules/gatk/paired_fastq_to_unmapped_bam/README.md
.
Workflow
level tests
These are conceptually similar to the Integration and End-to-End tests
and the same strategy for the module-level
tests can be used for the (sub)workflow-level
tests as well.