/crosscheckFingerprintsCollector

workflow to generate per library fingerprints for use in gatk CrosscheckFingerprints

Primary LanguageWDL

crosscheckFingerprintsCollector

crosscheckFingerprintsCollector, workflow that generates genotype fingerprints using gatk ExtractFingprint. Output are vcf files that can be proccessed through gatk Crosscheck fingerprints

Overview

Dependencies

Usage

Cromwell

java -jar cromwell.jar run crosscheckFingerprintsCollector.wdl --inputs inputs.json

Inputs

Required workflow parameters:

Parameter Value Description
inputType String one of either fastq or bam
aligner String aligner to use for fastq input, either bwa or star
markDups Boolean should the alignment be duplicate marked?, generally yes
filterBam Boolean should use filterBamToInterval to prefiltering of the bam file to intervals? Generally true
outputFileNamePrefix String Optional output prefix for the output
reference String the reference genome for input sample
sampleId String value that will be used as the sample identifier in the vcf fingerprint

Optional workflow parameters:

Parameter Value Default Description
fastqR1 File? None fastq file for read 1
fastqR2 File? None fastq file for read 2
bam File? None bam file
bamIndex File? None bam index file
maxReads Int 0 The maximum number of reads to process; if set, this will sample the requested number of reads

Optional task parameters:

Parameter Value Default Description
downsample.jobMemory Int 8 memory allocated for Job
downsample.timeout Int 24 Timeout in hours, needed to override imposed limits
bwaMem.adapterTrimmingLog_timeout Int 48 Hours before task timeout
bwaMem.adapterTrimmingLog_jobMemory Int 12 Memory allocated indexing job
bwaMem.indexBam_timeout Int 48 Hours before task timeout
bwaMem.indexBam_modules String "samtools/1.9" Modules for running indexing job
bwaMem.indexBam_jobMemory Int 12 Memory allocated indexing job
bwaMem.bamMerge_timeout Int 72 Hours before task timeout
bwaMem.bamMerge_modules String "samtools/1.9" Required environment modules
bwaMem.bamMerge_jobMemory Int 32 Memory allocated indexing job
bwaMem.runBwaMem_timeout Int 96 Hours before task timeout
bwaMem.runBwaMem_jobMemory Int 32 Memory allocated for this job
bwaMem.runBwaMem_threads Int 8 Requested CPU threads
bwaMem.runBwaMem_addParam String? None Additional BWA parameters
bwaMem.adapterTrimming_timeout Int 48 Hours before task timeout
bwaMem.adapterTrimming_jobMemory Int 16 Memory allocated for this job
bwaMem.adapterTrimming_addParam String? None Additional cutadapt parameters
bwaMem.adapterTrimming_modules String "cutadapt/1.8.3" Required environment modules
bwaMem.slicerR2_timeout Int 48 Hours before task timeout
bwaMem.slicerR2_jobMemory Int 16 Memory allocated for this job
bwaMem.slicerR2_modules String "slicer/0.3.0" Required environment modules
bwaMem.slicerR1_timeout Int 48 Hours before task timeout
bwaMem.slicerR1_jobMemory Int 16 Memory allocated for this job
bwaMem.slicerR1_modules String "slicer/0.3.0" Required environment modules
bwaMem.countChunkSize_timeout Int 48 Hours before task timeout
bwaMem.countChunkSize_jobMemory Int 16 Memory allocated for this job
bwaMem.numChunk Int 1 number of chunks to split fastq file [1, no splitting]
bwaMem.trimMinLength Int 1 minimum length of reads to keep [1]
bwaMem.trimMinQuality Int 0 minimum quality of read ends to keep [0]
bwaMem.adapter1 String "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" adapter sequence to trim from read 1 [AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC]
bwaMem.adapter2 String "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" adapter sequence to trim from read 2 [AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT]
star.indexBam_timeout Int 48 hours before task timeout
star.indexBam_modules String "picard/2.19.2" modules for running indexing job
star.indexBam_jobMemory Int 12 Memory allocated indexing job
star.runStar_timeout Int 72 hours before task timeout
star.runStar_jobMemory Int 64 Memory allocated for this job
star.runStar_threads Int 6 Requested CPU threads
star.runStar_peOvMMp Float 0.1 maximum proportion of mismatched bases in the overlap area
star.runStar_chimSegmentReadGapMax Int 3 maximum gap in the read sequence between chimeric segments
star.runStar_peOvNbasesMin Int 10 minimum number of overlap bases to trigger mates merging and realignment
star.runStar_chimOutJunForm Int? None flag to add metadata to chimeric junction output for functionality with starFusion - 1 for metadata, 0 for no metadata
star.runStar_chimNonchimScoDMin Int 10 to trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be greater than this value
star.runStar_chimMulmapNmax Int 50 maximum number of chimeric multi-alignments
star.runStar_chimScoreSeparation Int 1 minimum difference (separation) between the best chimeric score and the next one
star.runStar_chimScoJunNonGTAG Int 0 penalty for a non-GTAG chimeric junction
star.runStar_chimMulmapScoRan Int 3 the score range for multi-mapping chimeras below the best chimeric score
star.runStar_alignIntMax Int 100000 maximum intron size
star.runStar_alignMatGapMax Int 100000 maximum gap between two mates
star.runStar_alignSJDBOvMin Int 10 minimum overhang for annotated spliced alignments
star.runStar_chimJunOvMin Int 10 minimum overhang for a chimeric junction
star.runStar_chimSegmin Int 10 minimum length of chimeric segment length
star.runStar_multiMax Int -1 multiMax parameter for STAR
star.runStar_saSparsed Int 2 saSparsed parameter for STAR
star.runStar_uniqMAPQ Int 255 Score for unique mappers
star.runStar_chimScoreDropMax Int 30 max drop (difference) of chimeric score (the sum of scores of allchimeric segments) from the read length
star.runStar_outFilterMultimapNmax Int 50 max number of multiple alignments allowed for a read: if exceeded, the read is considered unmapped
star.runStar_addParam String? None Additional STAR parameters
star.runStar_genereadSuffix String "ReadsPerGene.out" ReadsPerGene file suffix
star.runStar_chimericjunctionSuffix String "Chimeric.out" Suffix for chimeric junction file
star.runStar_transcriptomeSuffix String "Aligned.toTranscriptome.out" Suffix for transcriptome-aligned file
star.runStar_starSuffix String "Aligned.sortedByCoord.out" Suffix for sorted file
filterBamToIntervals.jobMemory Int 16 memory allocated for Job
filterBamToIntervals.overhead Int 6 memory allocated to overhead of the job other than used in markDuplicates command
filterBamToIntervals.timeout Int 24 Timeout in hours, needed to override imposed limits
splitStringToArray.lineSeparator String "," Interval group separator - these are the intervals to split by.
splitStringToArray.recordSeparator String "+" Interval interval group separator - this can be used to combine multiple intervals into one group.
splitStringToArray.jobMemory Int 1 Memory allocated to job (in GB).
splitStringToArray.cores Int 1 The number of cores to allocate to the job.
splitStringToArray.timeout Int 1 Maximum amount of time (in hours) the task can run for.
splitStringToArray.modules String "" Environment module name and version to load (space separated) before command execution.
markDuplicates.jobMemory Int 16 memory allocated for Job
markDuplicates.overhead Int 6 memory allocated to overhead of the job other than used in markDuplicates command
markDuplicates.timeout Int 24 Timeout in hours, needed to override imposed limits
mergeBams.additionalParams String? None Additional parameters to pass to GATK MergeSamFiles.
mergeBams.jobMemory Int 24 Memory allocated to job (in GB).
mergeBams.overhead Int 6 Java overhead memory (in GB). jobMemory - overhead == java Xmx/heap memory.
mergeBams.cores Int 1 The number of cores to allocate to the job.
mergeBams.timeout Int 6 Maximum amount of time (in hours) the task can run for.
mergeBams.modules String "gatk/4.1.6.0" Environment module name and version to load (space separated) before command execution.
alignmentMetrics.jobMemory Int 8 memory allocated for Job
alignmentMetrics.timeout Int 24 Timeout in hours, needed to override imposed limits
extractFingerprint.jobMemory Int 8 memory allocated for Job
extractFingerprint.timeout Int 24 Timeout in hours, needed to override imposed limits

Outputs

Output Type Description Labels
outputVcf File the crosscheck fingerprint, gzipped vcf file vidarr_label: outputVcf
outputTbi File index for the vcf fingerprint vidarr_label: outputTbi
json File metrics in json format, currently only the mean coverage for the alignment vidarr_label: json
samstats File output from the samstats summary vidarr_label: samstats

Commands

This section lists command(s) run by CROSSCHECKFINGEPRINTSCOLLECTOR workflow

  • Running CROSSCHECKFINGEPRINTSCOLLECTOR

Fingerprint Generation

    set -euo pipefail
  
   $GATK_ROOT/bin/gatk ExtractFingerprint \
                      -R ~{refFasta} \
                      -H ~{haplotypeMap} \
                      -I ~{inputBam} \
                      -O ~{outputFileNamePrefix}.vcf \
                      --SAMPLE_ALIAS ~{sampleId}
  
   $TABIX_ROOT/bin/bgzip -c ~{outputFileNamePrefix}.vcf > ~{outputFileNamePrefix}.vcf.gz
   $TABIX_ROOT/bin/tabix -p vcf ~{outputFileNamePrefix}.vcf.gz 

downsampling,if requested

   set -euo pipefail
   
   seqtk sample -s 100 ~{fastqR1} ~{maxReads} > ~{fastqR1m}
   gzip ~{fastqR1m}
   
   seqtk sample -s 100 ~{fastqR2} ~{maxReads} > ~{fastqR2m}
   gzip ~{fastqR2m}

Duplicate Marking, if requested

    set -euo pipefail
    $GATK_ROOT/bin/gatk MarkDuplicates \
                        -I ~{inputBam} \
                        --METRICS_FILE ~{outputFileNamePrefix}.dupmetrics \
                        --VALIDATION_STRINGENCY SILENT \
                        --CREATE_INDEX true \
                        -O ~{outputFileNamePrefix}.dupmarked.bam

Coverage Assessment

    set -euo pipefail
    $SAMTOOLS_ROOT/bin/samtools coverage ~{inputBam} > ~{outputFileNamePrefix}.coverage.txt
    cat ~{outputFileNamePrefix}.coverage.txt | grep -P "^chr\d+\t|^chrX\t|^chrY\t" | awk '{ space += ($3-$2)+1; bases += $7*($3-$2);} END { print bases/space }' | awk '{print "{\"mean coverage\":" $1 "}"}' > ~{outputFileNamePrefix}.json

Support

For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .

Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)