/sexcmdWorkflow

wdl wrapper for SEXCMD tool. Identification of donor's sex using fastq inputs

Primary LanguageWDLMIT LicenseMIT

sexcmd

SEXCMD workflow relies on sex-specific markers to determine the sex of the analyzed sample. Forked from SEXCMD repo by Seongmun Jeong The workflow will produce a report which provides a prediction of sex of the input sample. Accuracy estimate is provided unless sex is UNDEFINED. SEXCMD will not predict sex if there is not enough information (reads aligned to sex-specific loci).

Dependencies

Usage

Cromwell

java -jar cromwell.jar run sexcmd.wdl --inputs inputs.json

Inputs

Required workflow parameters:

Parameter Value Description
inputFastq File input fastq file
inputBam File Input alignments in .bam format for the same fastq file
outputFileNamePrefix String Output file prefix
sexcmdReport.sequencingType Int 1:Whole Exome Seq, 2:RNA-Seq, 3:Whole Genome Seq

Optional task parameters:

Parameter Value Default Description
makeStats.jobMemory Int 6 Memory (GB) allocated for this job
makeStats.timeout Int 4 Number of hours before task timeout
makeStats.modules String "samtools/1.14" Modules needed to run SEXCMD
makeStats.outputPrefix String "STATREPORT" Custamizable output file name prefix
sexcmdReport.referenceFasta String "$HG38_SEXCMD_RES_ROOT/sex_marker_filtered.hg38.final.fasta" Marker reference file, comes from a resource module
sexcmdReport.outputFilePrefix String "SEXCMD" Prefix of the report file to be provisioned
sexcmdReport.sexModelType String "XY" Sex determination system type : XY or ZW
sexcmdReport.jobMemory Int 10 Memory (GB) allocated for this job
sexcmdReport.timeout Int 4 Number of hours before task timeout
sexcmdReport.modules String "hg38-sexcmd-res/1.0 sexcmd/1.0" Modules needed to run SEXCMD
estimateAccuracy.minLength Int 80 Hard threshold for minimum length, if average read length is less we have low accuracy (<60%)
estimateAccuracy.jobMemory Int 4 Memory (GB) allocated for this job
estimateAccuracy.timeout Int 4 Number of hours before task timeout

Outputs

Output Type Description
reportFile File Report generated by the tool

Commands

This section lists command(s) run by SEXCMD workflow

  • Running SEXCMD workflow

The workflow accepts a single fastq file which may be any of R1 or R2 files for paired reads sequencing. In a case when the workflow fails to find enough information to deduce the sex of the donor it will still generate a report which would say that sex is UNDEFINED

Generate report


   set -euo pipefail
   samtools stats INPUT_BAM | grep ^SN > OUTPUT.stats
   

Generate SEXCMD report

 
   touch PREFIX.OUTPUT
   Rscript  SEXCMD.R 
           -m REFERENCE_FASTA
           -t SEQUENCING_TYPE
           -s SEX_MODEL
           -f INPUT_FASTQ
           -p OUTPUT_PREFIX
 
   python3 snippet:
 
   outputFile = "PREFIX.OUTPUT"
   reportString = "Sex_Determination\tRatio of X and Y counts is NA\tSex of this sample BASENAME_OF_FASTQ is UNDEFINED\n"
   numberOfLines = 0
 
   with open(outputFile, "r") as inp:
       numberOfLines = len(inp.readlines())
   inp.close()
 
   if numberOfLines == 0:
       """Write down the string"""
       with open(outputFile, "w") as m:
           m.writelines(reportString)
       m.close()
 

Generate final report with Accuracy appended

 
   touch ~{outputFileNamePrefix}.report
 
   python3 snippet:
 
     coefficients = {"unmapped ratio": {"intercept": 0.9346, "slope": -0.3658},
                     "insert size standard deviation:": {"intercept": 0.9068, "slope": -0.0001}}
 
     def convert_value(value: str):
         try:
             return int(value)
         except ValueError:
             return float(value)
         print(f'Unable to convert [{value}], recording as-is')
         return value
 
     metrics = {}
     with open("~{bamStatFile}", "r") as samstats:
             data_lines = [line.strip() for line in samstats if line.startswith('SN')]
             for line in data_lines:
                 fields = line.split("\t")
                 if len(fields) >= 3:
                     metrics[fields[1]] = convert_value(fields[2])
 
     if "reads unmapped:" in metrics.keys() and "sequences:" in metrics.keys() and metrics["sequences:"] > 0:
         metrics["unmapped ratio"] = round(metrics["reads unmapped:"]/metrics["sequences:"], 3)
 
     accuracy = 1.0
     if "average length:" in metrics.keys() and metrics["average length:"] <= ~{minLength}:
         accuracy = 0
     else:
         for c in coefficients.keys():
             if c in metrics.keys():
                 estimated = coefficients[c]["intercept"] + metrics[c]*coefficients[c]["slope"]
                 if estimated < accuracy:
                     accuracy = round(estimated, 3)
 
     with open("~{sexcmdReport}", "r") as sexinfo:
         infolines = [line for line in sexinfo]
 
     if accuracy > 0:
         infolines.append(f'Estimated Accuracy is ~ {accuracy*100}%\n')
     else:
         infolines.append("Estimated Accuracy is < 60%\n")
 
     with open("~{outputFileNamePrefix}.report", "w") as out:
         out.writelines(infolines)
 

Support

For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .

Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)