dragmap

This workflow aligns sequence data provided as fastq files using Dragmap (an open source Dragen mapper/aligner). The alignment is completed using a hash table of a reference genome. The workflow borrows functions from the bwaMem workflow, to provide options to remove 5' umi sequences and trim off 3' sequencing adapters prior to alignment. Using these functions, this workflow can also split the input data into a requested number of chunks, align each separately then merge the separate alignments into a single BAM file to decrease workflow runtime. Read-group information must be provided.

Overview

Dependencies

Usage

Cromwell

java -jar cromwell.jar run dragmap.wdl --inputs inputs.json

Inputs

Required workflow parameters:

Parameter	Value	Description
`fastqR1`	File	Fastq file for read 1
`outputFileNamePrefix`	String	Prefix for output files
`reference`	String	The genome reference build. For example: hg19, hg38, mm10
`readGroups`	String	The read-group information to be added into the BAM file header

Optional workflow parameters:

Parameter	Value	Default	Description
`fastqR2`	File?	None	Fastq file for read 2
`numChunk`	Int	1	Number of chunks to split fastq file [1, no splitting]
`doUMIextract`	Boolean	false	If true, UMI will be extracted before alignment [false]
`doTrim`	Boolean	false	If true, adapters will be trimmed before alignment [false]
`numReads`	Int?	None	Number of reads

Optional task parameters:

Parameter	Value	Default	Description
`readGroupCheck.jobMemory`	Int	1	Memory allocated for this job
`readGroupCheck.timeout`	Int	1	Hours before task timeout
`countChunkSize.modules`	String	"python/3.7"	Required environment modules
`countChunkSize.jobMemory`	Int	16	Memory allocated for this job
`countChunkSize.timeout`	Int	48	Hours before task timeout
`slicerR1.modules`	String	"slicer/0.3.0"	Required environment modules
`slicerR1.jobMemory`	Int	16	Memory allocated for this job
`slicerR1.timeout`	Int	48	Hours before task timeout
`slicerR2.modules`	String	"slicer/0.3.0"	Required environment modules
`slicerR2.jobMemory`	Int	16	Memory allocated for this job
`slicerR2.timeout`	Int	48	Hours before task timeout
`extractUMIs.umiList`	String	"umiList"	Reference file with valid UMIs
`extractUMIs.outputPrefix`	String	"extractUMIs_output"	Specifies the start of the output files
`extractUMIs.pattern1`	String	"(?P<umi_1>^[ACGT]{3}[ACG])(?P<discard_1>T)	(?P<umi_2>^[ACGT]{3})(?P<discard_2>T)"
`extractUMIs.pattern2`	String	"(?P<umi_1>^[ACGT]{3}[ACG])(?P<discard_1>T)	(?P<umi_2>^[ACGT]{3})(?P<discard_2>T)"
`extractUMIs.modules`	String	"barcodex-rs/0.1.2 rust/1.45.1"	Required environment modules
`extractUMIs.jobMemory`	Int	24	Memory allocated for this job
`extractUMIs.timeout`	Int	12	Time in hours before task timeout
`adapterTrimming.modules`	String	"cutadapt/1.8.3"	Required environment modules
`adapterTrimming.doUMItrim`	Boolean	false	If true, do umi trimming
`adapterTrimming.umiLength`	Int	5	The number of bases to trim when doUMItrim is true. If the given length is positive, the bases are removed from the beginning of each read. If it is negative, the bases are removed from the end
`adapterTrimming.trimMinLength`	Int	1	Minimum length of reads to keep
`adapterTrimming.trimMinQuality`	Int	0	Minimum quality of read ends to keep
`adapterTrimming.adapter1`	String	"AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"	Adapter sequence to trim from read 1
`adapterTrimming.adapter2`	String	"AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"	Adapter sequence to trim from read 2
`adapterTrimming.addParam`	String?	None	Additional cutadapt parameters
`adapterTrimming.jobMemory`	Int	16	Memory allocated for this job
`adapterTrimming.timeout`	Int	48	Hours before task timeout
`runDragmap.addParam`	String?	None	Additional Dragmap parameters
`runDragmap.jobMemory`	Int	50	Memory allocated for the job
`runDragmap.timeout`	Int	96	Hours until task timeout
`bamMerge.jobMemory`	Int	32	Memory allocated for the job
`bamMerge.modules`	String	"samtools/1.14"	Required environment modules
`bamMerge.timeout`	Int	72	Hours until task timeout
`addReadGroups.modules`	String	"picard/3.1.0"	Required environment modules to run the task
`addReadGroups.jobMemory`	Int	12	Memory allocated for the job
`addReadGroups.timeout`	Int	5	Hours until task timeout
`adapterTrimmingLog.jobMemory`	Int	12	Memory allocated indexing job
`adapterTrimmingLog.timeout`	Int	48	Hours before task timeout

Outputs

Output	Type	Description	Labels
`dragmapBam`	File	Merged BAM file aligned to genome with Dragmap	vidarr_label: dragmapBam
`dragmapIndex`	File	Output index file for the merged BAM file	vidarr_label: dragmapIndex
`log`	File?	A summary log file for adapter trimming	vidarr_label: log
`cutAdaptAllLogs`	File?	A file with the logs from the adapter trimming of each fastq chunk	vidarr_label: cutAdaptAllLogs

Commands

This section lists command(s) run by dragmap workflow

Running dragmap

Ensures the read-group information is valid, and in the correct format prior to running the rest of the workflow.

         fieldNames=("ID=" "LB=" "PL=" "PU=" "SM=" "CN=" "DS=" "DT=" "FO=" "KS=" "PG=" "PI=" "PM=") 
         
         # Split the string into an array 
         IFS=, read -ra readFields <<< ~{readGroups}
 
         for field in "${readFields[@]}"; do 
             tag=${field:0:3}
             validTag=false
             for name in "${fieldNames[@]}"; do
                 if [ "$tag" == "$name" ]; then
                     validTag=true
                 fi
             done 
             if ! $validTag; then
                 # Redirect error message to stderr
                 echo "Invalid tag: '$tag'" >&2  
                 exit 1
             fi
         done

Parallelizes the alignment by splitting the fastq files into chunks. Subsequent steps will be run on the fastq chunks (Optional).

         if [ -z "~{numReads}" ]; then
             totalLines=$(zcat ~{fastqR1} | wc -l)
         else totalLines=$((~{numReads}*4))
         fi
         
         python3 -c "from math import ceil; print (int(ceil(($totalLines/4.0)/~{numChunk})*4))"
         slicer -i ~{fastqR} -l ~{chunkSize} --gzip

Trims off the UMI bases (Optional).

         barcodex-rs --umilist ~{umiList} --prefix ~{outputPrefix} --separator "__" inline \
         --pattern1 '~{pattern1}' --r1-in ~{fastq1} \
         ~{if (defined(fastq2)) then "--pattern2 '~{pattern2}' --r2-in ~{fastq2} " else ""}
 
         cat ~{outputPrefix}_UMI_counts.json > umiCounts.txt
 
         tr [,] ',\n' < umiCounts.txt | sed 's/[{}]//' > tmp.txt
         echo "{$(sort -i tmp.txt)}" > new.txt
         tr '\n' ',' < new.txt | sed 's/,$//' > ~{outputPrefix}_UMI_counts.json

Trims off adapter sequence (Optional).

         cutadapt -q ~{trimMinQuality} \
                 -m ~{trimMinLength} \
                 -a ~{adapter1} \
                 -o ~{resultFastqR1} \
                 ~{if (defined(fastqR2)) then "-A ~{adapter2} -p ~{resultFastqR2} " else ""} \
                 ~{if (doUMItrim) then "-u ~{umiLength} -U ~{umiLength} " else ""} \
                 ~{addParam} \
                 ~{fastqR1} \
                 ~{fastqR2} > ~{resultLog}

Align to reference using Dragmap.

         dragen-os \
             -r ~{dragmapHashTable} \
             -1 ~{read1s} \
             ~{if (defined(read2s)) then "-2 ~{read2s}" else ""} \
             ~{addParam} \
         | \
         samtools view -b -o ~{resultBam} -

Merge parallelized alignments (if the fastq was split) and sort the BAM file.

         mkdir -p ~{tmpDir}
 
         samtools merge -O bam - ~{sep=" " outputBams} \
         | \
         samtools sort -O bam -T ~{tmpDir} -o ~{resultMergedBam} -

Assigns read-groups and index BAM file.

         # An array containing potential read-group fields
         fieldNames=("ID=" "LB=" "PL=" "PU=" "SM=" "CN=" "DS=" "DT=" "FO=" "KS=" "PG=" "PI=" "PM=")
 
         # Declares an associative array and appends the values of the fields present in readGroups string
         declare -A fieldsArray
         
         for field in "${fieldNames[@]}"; do
             if [[ ~{readGroups} == *${field}* ]]; then 
                 fieldsArray[${field}]=$(echo ~{readGroups} | awk -F "${field}" '{print $2}' | cut -d ',' -f 1) 
             fi
         done
 
         java -jar picard.jar AddOrReplaceReadGroups \
             CREATE_INDEX=true \
             I= ~{mergedBam} \
             O= ~{resultReadGroupBam} \
             $( [[ -v fieldsArray["ID="] ]] && echo "RGID=${fieldsArray["ID="]}" || : ) \
             $( [[ -v fieldsArray["LB="] ]] && echo "RGLB=${fieldsArray["LB="]}" || : ) \
             $( [[ -v fieldsArray["PL="] ]] && echo "RGPL=${fieldsArray["PL="]}" || : ) \
             $( [[ -v fieldsArray["PU="] ]] && echo "RGPU=${fieldsArray["PU="]}" || : ) \
             $( [[ -v fieldsArray["SM="] ]] && echo "RGSM=${fieldsArray["SM="]}" || : ) \
             $( [[ -v fieldsArray["CN="] ]] && echo "RGCN=${fieldsArray["CN="]}" || : ) \
             $( [[ -v fieldsArray["DS="] ]] && echo "RGDS=${fieldsArray["DS="]}" || : ) \
             $( [[ -v fieldsArray["DT="] ]] && echo "RGDT=${fieldsArray["DT="]}" || : ) \
             $( [[ -v fieldsArray["FO="] ]] && echo "RGFO=${fieldsArray["FO="]}" || : ) \
             $( [[ -v fieldsArray["KS="] ]] && echo "RGKS=${fieldsArray["KS="]}" || : ) \
             $( [[ -v fieldsArray["PG="] ]] && echo "RGPG=${fieldsArray["PG="]}" || : ) \
             $( [[ -v fieldsArray["PI="] ]] && echo "RGPI=${fieldsArray["PI="]}" || : ) \
             $( [[ -v fieldsArray["PM="] ]] && echo "RGPM=${fieldsArray["PM="]}" || : )

Merges parallelized adapter trimming logs.

         COMMANDS NOT SHOWN, see WDL for details

Support

For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .

Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)

oicr-gsi/dragmap