/cfMedipsQc

cfMedipsQC workflow produces a set of metrics files for sequencing data generated in methylation profiling of circulating Free DNA.

Primary LanguageWDL

cfMedipsQc

cfMedipsQC workflow produces a set of metrics files for sequencing data generated in methylation profiling of circulating Free DNA

cfMedipsQC flowchart

Overview

Dependencies

Usage

Cromwell

java -jar cromwell.jar run cfMedipsQc.wdl --inputs inputs.json

Inputs

Required workflow parameters:

Parameter Value Description
fastq1 File Read 1 input fastq file
fastq2 File Read 2 input fastq file
reference String reference id for using with the analysis
fastqFormat String Quality encoding, default is phred33, but can be set to phred64
extractMedipsCounts.medips_script String Path to the wrapper medips.R script

Optional workflow parameters:

Parameter Value Default Description
window Int 300 window length, over which to assess

Optional task parameters:

Parameter Value Default Description
trimming.fastq1Basename String basename("~{fastq1}","_R1_001.fastq.gz") Basename for the first fastq
trimming.fastq2Basename String basename("~{fastq2}","_R2_001.fastq.gz") Basename for the second fastq
trimming.headCrop Int 5 How many bases to crop
trimming.threads Int 6 Requested CPU threads
trimming.jobMemory Int 16 Memory (GB) allocated for this job
trimming.timeout Int 6 Number of hours before task timeout
trimming.modules String "trimmomatic/0.39" Module needed to run trimmomatic extract
alignment.basename String basename("~{fastq1Paired}",".R1_paired.fastq.gz") Name to make output sam file
alignment.threads Int 8 Requested CPU threads
alignment.jobMemory Int 16 Memory (GB) allocated for this job
alignment.timeout Int 6 Number of hours before task timeout
alignment.modules String "bowtie2/2.1.0 ~{referenceModule}" Module needed to run bowtie2 alignment
preprocessing.basename String basename("~{samFile}",".sam") Name to make output files
preprocessing.threads Int 8 Requested CPU threads
preprocessing.jobMemory Int 16 Memory (GB) allocated for this job
preprocessing.timeout Int 6 Number of hours before task timeout
preprocessing.modules String "samtools/1.9 picard/2.21.2" Module needed to run preprocessing
alignmentMetrics.basename String basename("~{dedupBam}",".sorted.dedup.bam") Name to make output files
alignmentMetrics.threads Int 8 Requested CPU threads
alignmentMetrics.jobMemory Int 32 Memory (GB) allocated for this job
alignmentMetrics.timeout Int 6 Number of hours before task timeout
alignmentMetrics.modules String "samtools/1.9 picard/2.21.2 ~{referenceModule} bc/2.1.3 rstats/3.5" Modules needed to run alignment metrics
splitFaiToArray.memory Int 1 Memory allocated for this job
splitFaiToArray.timeout Int 1 Hours before task timeout
getChromosomeLength.memory Int 1 Memory allocated for this job
getChromosomeLength.timeout Int 1 Hours before task timeout
getChrCoefficient.memory Int 1 Memory allocated for this job
getChrCoefficient.timeout Int 1 Hours before task timeout
extractMedipsCounts.basename String basename("~{dedupBam}",".sorted.dedup.bam") basename for the sample
extractMedipsCounts.convert2bed String "$BEDOPS_ROOT/convert2bed" path to conver2bed program
extractMedipsCounts.threads Int 8 Requested CPU threads
extractMedipsCounts.jobMemory Int 32 Memory (GB) allocated for this job
extractMedipsCounts.timeout Int 6 Number of hours before task timeout
extractMedipsCounts.minCount Int 5 Minimal counts required to process a bam file
extractMedipsCounts.modules String "samtools/1.9 rstats/3.5 cfmedips/1.5.1 bedops/2.4.37" Modules needed to run alignment metrics
aggregateMetrics.jobMemory Int 1 Memory allocated for this job
aggregateMetrics.timeout Int 1 Hours before task timeout
finalMetrics.threads Int 8 Requested CPU threads
finalMetrics.jobMemory Int 16 Memory (GB) allocated for this job
finalMetrics.timeout Int 6 Number of hours before task timeout
finalMetrics.modules String "cfmedips/1.5.1" Modules needed to run alignment metrics

Outputs

Output Type Description Labels
outputAlignmentSummaryMetrics File Metric for alignments vidarr_label: outputAlignmentSummaryMetrics
outputBaseDistributionMetrics File Metrics for base distributions vidarr_label: outputBaseDistributionMetrics
outputInsertSizeMetrics File? Metrics for insert size (Optional, when enough data are available) vidarr_label: outputInsertSizeMetrics
outputQualityByCycleMetrics File Quality by cycle metrics vidarr_label: outputQualityByCycleMetrics
outputQualityDistributionMetrics File Quality distribution metrics vidarr_label: outputQualityDistributionMetrics
outputGcBiasMetrics File gc bias metrics vidarr_label: outputGcBiasMetrics
outputSummaryGcBiasMetrics File Summary of gc bias metrics
outputThaliaSummary File Summary of thalia chromosomes vidarr_label: outputThaliaSummary
outputDedupBam File De-duplicated bam file vidarr_label: outputDedupBam
outputqcMetrics File Final output vidarr_label: outputqcMetrics

Commands

This section lists commands run by cfMedipsQC workflow

  • Running cfMedipsQC

cfMedipsQC is designed to produce several QC metrics using samtools and picard. Runs it's own alignment with bowtie2

Calculate scaling coefficient for dynamic allocation of RAM

   grep -w ^~{chromosome} REF_FAI | cut -f 2 | awk '{print int(($1/~{LARGEST_CHR_BASES} + 0.1) * 10)'

Run read trimming with trimmomatic

    set -euo pipefail
    trimmomatic PE  FASTQ_R1 FASTQ_R2 \
                    "-phred33 \  (can be set to phred64)
                    "FASTQ_R1_BASENAME.R1_paired.fastq.gz" "FASTQ_R1_BASENAME.R1_unpaired.fastq.gz" "FASTQ_R2_BASENAME.R2_paired.fastq.gz" "FASTQ_R2_BASENAME.R2_unpaired.fastq.gz" \
                    HEADCROP:5 (how many bases to crop, configurable)

Align with Bowtie2

    set -euo pipefail
    bowtie2 -p 8 -x REFEFENCE_FASTA \
            -1 FASTQ_R1 \
            -2 FASTQ_R2 \
            -S "BASENAME.sam"

Preprocessing: sort and mark duplicates without removing them

    set -euo pipefail
    samtools view -bS SAM_FILE | samtools sort -o "BASENAME.sorted.bam"
    java -jar picard.jar MarkDuplicates \
      I="BASENAME.sorted.bam" \
      O="BASENAME.sorted.dedup.bam" \
      M="BASENAME.sorted.dedup.metrics" \
      ASSUME_SORTED=true \
      CREATE_INDEX=true \
      VALIDATION_STRINGENCY=SILENT \
      REMOVE_DUPLICATES=false

Collecting metrics using picard tools

    set -euo pipefail
    java -jar picard.jar CollectMultipleMetrics \
      R=REFERENCE_GENOME\
      I=DEDUPLICATED_BAM \
      O=BASENAME \
      VALIDATION_STRINGENCY=SILENT
    java -jar picard.jar CollectGcBiasMetrics \
      R=REFERENCE_GENOME \
      I=DEDUPLICATED_BAM \
      O="BASENAME.gc_bias_metrics.txt" \
      S="BASENAME.summary_gc_bias_metrics.txt" \
      CHART="BASENAME.gc_bias_metrics.pdf"
    samtools view DEDUPLICATED_BAM | cut -f 3 | sort | uniq -c | sort -nr | sed -e 's/^ *//;s/ /\t/' | awk 'OFS="\t" {print $2,$1}' | sort -n -k1,1 > thalia.counts
    total=$(samtools view DEDUPLICATED_BAM | wc -l)
    if [[ $(cat thalia.counts | grep "^\*" | cut -f2) == "" ]]; then unmap=0; else unmap=$(cat thalia.counts | grep "^\*" | cut -f2); fi
    if [[ $(cat thalia.counts | grep F19K16 | cut -f2) == "" ]]; then methyl=0; else methyl=$(cat thalia.counts | grep F19K16 | cut -f2); fi
    if [[ $(cat thalia.counts | grep F24B22 | cut -f2) == "" ]]; then unmeth=0; else unmeth=$(cat thalia.counts | grep F24B22 | cut -f2); fi
    if [[ $total == "" ]]; then pct_thalia=0; else pct_thalia=$(echo "scale=3; ($methyl + $unmeth)/$total * 100" | bc -l); fi
    if [[ -z $pct_thalia ]]; then pct_thalia="0"; fi
    if [[ $methyl == 0 && $unmeth == 0 ]]; then bet_thalia=0; else bet_thalia=$(echo "scale=3; $methyl/($methyl + $unmeth)" | bc -l); fi 
    if [[ -z $bet_thalia ]]; then bet_thalia="0"; fi
    echo -e "total\tunmap\tmethyl\tunmeth\tPCT_THALIANA\tTHALIANA_BETA" > thalia_summary.txt
    echo -e "$total\t$unmap\t$methyl\t$unmeth\t$pct_thalia\t$bet_thalia" >> thalia_summary.txt
 

Collect chromosomal ids

    cut -f 1 ~{refFai} | uniq | grep -v _ | grep -v M | grep ^chr

Collect chromosomal sizes

    grep -w ^~{chromosome} ~{refFai} | cut -f 2

Aggregate metrics collected by extractMedipsCount task

    set -euo pipefail   
     cat ~{sep=" " coverageWindowsFiles} > "coverage_windows.tmp"
     head -n 1 coverage_windows.tmp > coverage_windows.txt
     grep -v ^sample coverage_windows.tmp | awk '{sum1+=$1;sum2+=$2;sum3+=$3;sum4+=$4;sum5+=$5} END{OFS="\t";print sum1,sum2,sum3,sum4,sum5;}' >> coverage_windows.txt
 
     cat ~{sep=" " enrichmentDataFiles} > "enrichment_data.tmp"
     head -n 1 enrichment_data.tmp > enrichment_data.txt
     grep -v regions enrichment_data.tmp | awk '{sum1+=$2;sum2+=$3;sum3+=$4;sum4+=$5;sum5+=$6;sum6+=$7;sum7+=$8;sum8+=$9;sum9+=$10;sum10+=$11;sum11+=$12;sum12+=$13;sum13+=$14;count+=1}END{OFS="\t";print $1, $2, sum2, sum3, sum4, sum5/count, sum6/count, sum7/count, sum8/count, sum9/count, sum10/count, sum11/count, sum12/count, sum13/count}' >> enrichment_data.txt
 
     cat ~{sep=" " coverageCountsFiles} >  "coverage_counts.tmp"
     head -n 1 coverage_counts.tmp > coverage_counts.txt
     tail +1 coverage_counts.tmp | awk -F'\t' '{sum1+=$2;sum2+=$3} END{OFS="\t";print $1,sum1,sum2;}' >> coverage_counts.txt
 
     cat ~{sep=" " saturationMetricsFiles} > "saturation_metrics.tmp"
     head -n 1 saturation_metrics.tmp > saturation_metrics.txt
     grep -v maxEstCorReads saturation_metrics.tmp | awk '{sum1+=$2;sum2+=$3;sum3+=$4;sum4+=$5;count+=1}END{OFS="\t";print $1, sum1, sum2/count, sum3, sum4/count}' >> saturation_metrics.txt
 
     cat ~{sep=" " nameFiles} > "name.tmp"
     head -n 1 name.tmp > name.txt
     grep -v samples name.tmp | sort -u >> name.txt 

Extract Medips Counts

     reference=~{reference}
  
     if [[ $reference == hg19 ]]; then
       bsGenome=BSgenome.Hsapiens.UCSC.hg19
     elif [[ $reference == hg38 ]]; then
       bsGenome=BSgenome.Hsapiens.UCSC.hg38
     else
       echo "Unsupported Reference $reference"
       exit
     fi
  
     set -euo pipefail
       
     samtools view -h ~{dedupBam} ~{chromosome}:1-~{chromosomeLength} -b > "~{chromosome}.dedup.bam"
     READ_COUNT=$(samtools view ~{chromosome}.dedup.bam | wc -l)
 
     if (( $READ_COUNT < MIN_COUNT )); then
       touch coverage_windows.txt
       touch name.txt
       touch coverage_counts.txt
       touch enrichment_data.txt
       touch genome_count.txt
       touch MEDIPS_window_per_chr.csv
       touch saturation_metrics.txt
       touch medips.bed
     else
       Rscript MEDIPS_SCRIPT.R \
         --basedir . \
         --bamfile CHROM##.dedup.bam
         --samplename BASENAME \
         --BSgenome bsGenome \
         --chromosome CHROM## \
         --ws  WINDOW\
         --outdir .
       NAME=""
       count0=$(awk '$1 == 0' genome_count.txt | wc -l)
       count1=$(awk '$1 >= 1' genome_count.txt | wc -l)
       count10=$(awk '$1 >= 10' genome_count.txt | wc -l)
       count50=$(awk '$1 >= 50' genome_count.txt | wc -l)
       count100=$(awk '$1 >= 100' genome_count.txt | wc -l)
       echo -e "sample\tcount0\tcount1\tcount10\tcount50\tcount100" > coverage_windows.txt
       echo -e "$NAME\t$count0\t$count1\t$count10\t$count50\t$count100" >> coverage_windows.txt
       echo -e "samples\n~{basename}" > name.txt
       CONVERT2BED_EXECUTABLE -d --input wig < medips.wig > medips.bed
     fi
 

Prepare json report (Final Metrics):

     set -euo pipefail
     txt-to-json.py -n FILE_NAME \
                    -e ENRICHMENT_DATA \
                    -c COVERAGE_DATA \
                    -w COVERAGE_WINDOW \
                    -s SATURATION_DATA \
                    -d DEDUPLICATION_DATA \
                    -u GCBIAS_DATA \
                    -a ALIGNMENT_SUMMARY_DATA \
                    -t TALIA_SUMMARY

Support

For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .

Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)