/analysis-wdls

Early stages of converting genome/analysis-workflows from CWL to WDL

Primary LanguageWDL

Next Major Steps

Our main concern so far has been getting CWLs converted to WDL. Following this will be efforts on optimization of the workflows, and cleanup of the repository.

In the future we may rework the structure of this repository to a format that Dockstore supports and leverage that tool.

Common Errors

Out of Space

These errors indicate that the disk storage space has been filled. That part is pretty straightforward. The part that's a bit more of a pitfall is that depending on what output this happened on, you need to change different disk sizes.

If the failed write happened at /cromwell_root path, then disks: "local-disk ..." needs to be increased. However, if the failed write happens during to stdin or stdout, or any of the other standard Linux-y places, then you'll need to increase the value of bootDiskSizeGb. Cromwell in GCP mounts two disks, at minimum: the boot disk, and a local-disk. Boot disk handles all the operating system files, but local-disk is where almost all of your "work" is going to happen, besides piping between commands.

File missing

This applies more to newly converted files than hardened ones but many runs failed because a file wasn't included in the instance. Generally, this happens because the CWL did not specify a secondaryFile that it assumed would exist next to the passed in file. This works on the cluster, because the tools just look for the file and it already sits where it's expected. This does not work on the cloud because that file is never sent to the instance. The solution is to add this parameter explicitly to the WDL and pass it through, top down.

CommandException: No URLs matched

This is one of two things. Either (A) the input is malformed or otherwise incorrect, or (B) the specified file was not uploaded to the bucket. These are both instances of the general version of the error, "No file has been uploaded to the specified URL".

Differences from CWL

Last confirmed mirror with the analysis-workflows CWL repo was commit 788bdc99c1d5b6ee7c431c3c011eb30d385c1370, PR#1063, Apr6 2022. Commits from that point on may deviate unless compared. Update these values if that is done.

Directory types must be a zip file, or Array[File]

There is not yet a supported Directory type in WDL. Instances of this like Directory vep_cache_dir which involve nested directory structure are replaced with File vep_cache_dir_zip. Instances of this like Directory hla_call_files which are just a flat collection of files are replaced with Array[File] hla_call_files.

Input files must prefix arguments with the name of the workflow

Input files must prefix each argument with the name of the workflow they're going to run, because a WDL file can contain multiple workflows or pass inputs over a layer if they aren't propagated through in the definition. e.g. to call workflow somaticExome with input foo, yaml key must be somaticExome.foo

If WDLs are being used leveraging the cloud-workflows/scripts/cloudize-workflow.py helper script, the generated input file will have this handled already.

Conversions

Pipelines

  • alignment_exome
  • alignment_exome_nonhuman
  • alignment_umi_duplex # this depends on a thing with non-trivial embedded javascript
  • alignment_umi_molecular # this depends on a thing with non-trivial embedded javascript
  • alignment_wgs
  • alignment_wgs_nonhuman
  • aml_trio_cle
  • aml_trio_cle_gathered # This doesn't make sense in cloud
  • bisulfite
  • chipseq # This depends on homer-tag-directory, doesn't make sense in cloud
  • chipseq_alignment_nonhuman # This depends on homer-tag-directory, doesn't make sense in cloud
  • detect_variants
  • detect_variants_nonhuman
  • detect_variants_wgs
  • downsample_and_recall
  • gathered_downsample_and_recall # This doesn't make sense in cloud
  • germline_exome
  • germline_exome_gvcf
  • germline_exome_hla_typing
  • germline_wgs
  • germline_wgs_gvcf
  • immuno
  • rnaseq
  • rnaseq_star_fusion
  • rnaseq_star_fusion_with_xenosplit
  • somatic_exome
  • somatic_exome_cle
  • somatic_exome_cle_gathered # This doesn't make sense in cloud
  • somatic_exome_gathered # This doesn't make sense in cloud
  • somatic_exome_nonhuman
  • somatic_wgs
  • tumor_only_detect_variants
  • tumor_only_exome
  • tumor_only_wgs

Subworkflows

  • align
  • align_sort_markdup
  • bam_readcount
  • bam_to_trimmed_fastq_and_hisat_alignments
  • bgzip_and_index
  • bisulfite_qc
  • cellranger_mkfastq_and_count
  • cnvkit_single_sample
  • cram_to_bam_and_index
  • cram_to_cnvkit
  • docm_cle
  • docm_germline
  • duplex_alignment
  • filter_vcf
  • filter_vcf_nonhuman
  • fp_filter
  • gatk_haplotypecaller_iterator
  • germline_detect_variants
  • germline_filter_vcf
  • hs_metrics
  • joint_genotype
  • merge_svs
  • molecular_alignment
  • molecular_qc
  • mutect
  • phase_vcf
  • pindel
  • pindel_cat
  • pindel_region
  • pvacseq
  • qc_exome
  • qc_exome_no_verify_bam
  • qc_wgs
  • qc_wgs_nonhuman
  • sequence_align_and_tag_adapter
  • sequence_to_bqsr
  • sequence_to_bqsr_nonhuman
  • sequence_to_trimmed_fastq
  • sequence_to_trimmed_fastq_and_biscuit_alignments
  • single_cell_rnaseq
  • single_sample_sv_callers
  • strelka_and_post_processing
  • strelka_process_vcf
  • sv_depth_caller_filter
  • sv_paired_read_caller_filter
  • umi_alignment
  • varscan
  • varscan_germline
  • varscan_pre_and_post_processing
  • vcf_eval_cle_gold
  • vcf_eval_concordance
  • vcf_readcount_annotator

Tools

  • add_strelka_gt
  • add_string_at_line
  • add_string_at_line_bgzipped
  • add_vep_fields_to_table
  • agfusion
  • align_and_tag
  • annotsv
  • annotsv_filter
  • apply_bqsr
  • bam_readcount
  • bam_to_bigwig
  • bam_to_cram
  • bam_to_fastq
  • bam_to_sam
  • bcftools_merge
  • bedgraph_to_bigwig
  • bedtools_intersect
  • bgzip
  • biscuit_align
  • biscuit_markdup
  • biscuit_pileup
  • bisulfite_qc_conversion
  • bisulfite_qc_coverage_stats
  • bisulfite_qc_cpg_retention_distribution
  • bisulfite_qc_mapping_summary
  • bisulfite_vcf2bed
  • bqsr
  • call_duplex_consensus
  • call_molecular_consensus
  • cat_all
  • cat_out
  • cellmatch_lineage
  • cellranger_atac_count
  • cellranger_count
  • cellranger_feature_barcoding
  • cellranger_mkfastq
  • cellranger_vdj
  • cle_aml_trio_report_alignment_stat
  • cle_aml_trio_report_coverage_stat
  • cle_aml_trio_report_full_variants
  • clip_overlap
  • cnvkit_batch
  • cnvkit_vcf_export
  • cnvnator
  • collect_alignment_summary_metrics
  • collect_gc_bias_metrics
  • collect_hs_metrics
  • collect_insert_size_metrics
  • collect_wgs_metrics
  • combine_gvcfs
  • combine_variants
  • combine_variants_concordance
  • combine_variants_wgs
  • concordance
  • cram_to_bam
  • docm_add_variants
  • docm_gatk_haplotype_caller
  • downsample
  • duphold
  • duplex_seq_metrics
  • eval_cle_gold
  • eval_vaf_report
  • extract_hla_alleles
  • extract_umis
  • fastq_to_bam
  • filter_consensus
  • filter_known_variants
  • filter_sv_vcf_blocklist_bedpe
  • filter_sv_vcf_depth
  • filter_sv_vcf_read_support
  • filter_sv_vcf_size
  • filter_vcf_cle
  • filter_vcf_coding_variant
  • filter_vcf_custom_allele_freq
  • filter_vcf_depth
  • filter_vcf_docm
  • filter_vcf_mapq0
  • filter_vcf_somatic_llr
  • fix_vcf_header
  • fp_filter
  • gather_to_sub_directory
  • gatherer
  • gatk_genotypegvcfs
  • gatk_haplotype_caller
  • generate_qc_metrics
  • germline_combine_variants
  • grolar
  • group_reads
  • hisat2_align
  • hla_consensus
  • homer_tag_directory # This doesn't make sense in cloud
  • index_bam
  • index_cram
  • index_vcf
  • intersect_known_variants
  • interval_list_expand
  • intervals_to_bed
  • kallisto
  • kmer_size_from_index
  • manta_somatic
  • mark_duplicates_and_sort
  • mark_illumina_adapters
  • merge_bams
  • merge_bams_samtools
  • merge_vcf
  • mutect
  • name_sort
  • normalize_variants
  • optitype_dna
  • picard_merge_vcfs
  • pindel
  • pindel2vcf
  • pindel_somatic_filter
  • pizzly
  • pvacbind
  • pvacfuse
  • pvacseq
  • pvacseq_combine_variants
  • pvacvector
  • read_backed_phasing
  • realign
  • remove_end_tags
  • rename
  • replace_vcf_sample_name
  • samtools_flagstat
  • samtools_sort
  • select_variants
  • sequence_align_and_tag
  • sequence_to_bam # this uses non-trivial embedded javascript
  • sequence_to_fastq
  • set_filter_status
  • single_sample_docm_filter
  • smoove
  • somatic_concordance_graph
  • sompy
  • sort_vcf
  • split_interval_list
  • split_interval_list_to_bed
  • staged_rename
  • star_align_fusion
  • star_fusion_detect
  • strandedness_check
  • strelka
  • stringtie
  • survivor
  • transcript_to_gene
  • trim_fastq
  • umi_align
  • variants_to_table
  • varscan_germline
  • varscan_process_somatic
  • varscan_somatic
  • vcf_expression_annotator
  • vcf_readcount_annotator
  • vcf_sanitize
  • vep
  • verify_bam_id
  • vt_decompose
  • xenosplit