common-workflow-lab/wdl-cwl-translator

Some WDL Maps are not handled

dinithiW opened this issue · 8 comments

Currently, the WDL input compound type Map[] is not handled by the translator. (https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#mapp-y)

A workflow using Maps is given in this workflow (https://github.com/biowdl/tasks/blob/bc1bacf11498d2d30b85591cfccdcf71ef0966a5/star.wdl#L144)

edit by @mr-c , we now correctly translate non-input variables of type Map as demonstrated by #189 but I can't find any examples of real-world WDL workflows with inputs of type Map ; currently https://github.com/broadinstitute/warp/blob/b09880a71e3d3e42fa4b544d03aea23c0a246efc/pipelines/broad/dna_seq/germline/joint_genotyping/by_chromosome/JointGenotypingByChromosomePartOne.wdl#L46 does not translate, but that is probably more to due with the use of the Set type

I am looking at this, and I am not exactly sure what I should do. Is it just to create a translation from WDL to CWL like here or am I missing something obvious?

Oh I should create a class for map, is that it?

mr-c commented

@RaOneG Pardon my delay

In this instance,

     Map[String, String] samOutputNames = {"BAM SortedByCoordinate": "sortedByCoord.out.bam"}

is not in the WDLinputs section, it is a private declaration.

So to convert this example we would need to figure out how to represent the only other place samOutputNames is used. Here it is in the outputs section, to calculate the name of the output file:

   File bamFile = outFileNamePrefix + "Aligned." +  samOutputNames[outSAMtype]

So the strategy for Map used outside the inputs section (and not used as the type of one of the outputs) is to

  1. Convert that private declaration to Javascript. In this example that would be
var samOutputNames = Map([ ["BAM SortedByCoordinate", "sortedByCoord.out.bam"] ];
  1. Then continue as normal.
    For our example, that would be producing
${
var samOutputNames = new Map([ ["BAM SortedByCoordinate", "sortedByCoord.out.bam"] ]);
return inputs.outFileNamePrefix + "Aligned." + samOutputNames.get(inputs.outSAMtype);
}

for the output_glob at

outputBinding=cwl.CommandOutputBinding(glob=output_glob),

However, the Javascript Map type is not part of ECMAScript 5.1, which is the version of Javascript that CWL v1.x Expressions use

While in this instance we can rely on the Map like qualities of object in Javascript because all the keys in samOutputNames are strings, there may be regular uses of WDL's Map type that use other types of keys, like numbers.

So for now we can use a Javascript Object instead:

${
var samOutputNames = {"BAM SortedByCoordinate": "sortedByCoord.out.bam"};`
return inputs.outFileNamePrefix + "Aligned." + samOutputNames[inputs.outSAMtype];
}

But eventually we will have to find another solution like adapting this polyfill that brings the ECMAScript 6 Map to ECMAscript 5: https://github.com/zloirock/core-js/blob/v2.6.12/modules/es6.map.js

Hi @mr-c,
I am sorry but this is a little difficult for me to understand in addition to I have no prior knowledge of Javascript. Is there something I can learn/read to get some more in-depth knowledge about what you mentioned above apart from the Javascript part?

mr-c commented

Hello @RaOneG
A very basic level of JavaScript understanding will be necessary. I personally know very little Javascript, instead I constantly do web searches to remind myself of the correct syntax. I literally Google searched for "if then else javascript" the other week!

As for my comments on this issue, you can ignore everything I wrote about "polyfill"s and ECMAScript Map.

mr-c commented

I asked in the OpenWDL slack for a real-world example of a Map input https://openwdl.slack.com/archives/CT0QEK1V0/p1646937620784419 ; pending a real-world usage or user request, I think this is now a low-priority issue

mr-c commented

@dpark01 writes

so from a high level, the reason I find myself using read_json or Map structures in general in WDL is when I'm

  • scattering on X
  • need to track/coordinate metadata on a per-X basis, either post-gather or even inside the scatter block

https://github.com/broadinstitute/viral-pipelines/blob/b4010419ac2710b73e6c70be491e43778b1625f8/pipes/WDL/tasks/tasks_ncbi_tools.wdl#L433-L519

this particular example task is an interesting one where I take a bunch of same-length Array inputs and basically zip them together and then deduplicate them based on one of the strings: that is, it's assumed that the biosamples input array may contain non-unique values, so the output of this task produces an output Array of unique biosamples and then a bunch of Maps that let the workflow code query metadata inside a scatter block for each biosample.

https://github.com/broadinstitute/viral-pipelines/blob/bd0d777617229e61dd9d659054f99e568820295b/pipes/WDL/workflows/sarscov2_illumina_full.wdl#L105-L111

in an unrelated example, here's a workflow that calls other tasks that emit Map outputs and then looks up values in that Map in the middle of a scatter block in order to know what to do differently with each sample: