nfdi4plants/ARC-specification

Discussion: run.cwl in run folders

caroott opened this issue · 9 comments

The ARC specification states under Workflow description, that that tools and workflows, that are used during computational analysis must be described in the workflows folder as .cwl files. In the run description it is stated, that each run needs a corresponding run.cwl, that describes how that exact run result is composed.

Due to the nature of CWL, this run.cwl may be unnecessary overhead or could be simplified. All necessary information about the run execution can be derived from the combination of the executed .cwl file and the run.yml. The run.yml is already located in the corresponding runs folder. So only the information of the CWL file that was executed remains. If one were to create the run.cwl as stated in the specification, I have two possibilities in mind:

1. Wrap the executed tool or workflow in another workflow:

This comes with the disadvantage, that it is quite a large overhead. All inputs required must be specified in the workflow again, and mapped to the inputs required by the tool/workflow. The outputs then must be collected as usual in a workflow.

In the worst case, the run.cwl file is almost like a copy of the referenced workflow.cwl.

2. Create a tool CWL, that executes the cwl runner with the given cwl and yml files:

Example:

cwlVersion: v1.2
class: CommandLineTool
baseCommand: [cwltool, ../../workflows/MyWorkflow/workflow.cwl, run.yml]
outputs:
  myOutput:
    type: Directory
    outputBinding:
      # this returns the whole working directory
      glob: $(runtime.outdir)

This way, it's just the executing command wrapped in a command line tool CWL. It returns the entire output directory, so as long as the executed workflow is well described, it should return everything as intended. This could only be difficult, if expression tools are used at the end of a workflow to sort files. This is only a small overhead and contains all required information.

Since the information we require is only what workflow/tool is executed, can we maybe find a better way to represent that information? Or do we want to stick with the run.cwl and recommend the example i posted? Or do we want to recommend wrapping everything in one workflow again?


Edit: links, format, small adjustments

Reading again through this: This is a question specific to when a run executes a workflow, right? When the run is self-contained, the run.cwl is too?

That depends. I interpreted the ARC specification so, that every computational step should described either as a tool or workflow description and saved in the workflows folder. That wouldn't allow for self-contained runs, unless the run requires no computational steps.

This was simply due to a mistake: run.cwl is meant to be run.yml. The idea is that under workflow you find the more re-usable part and run is facilitated by the specific run parameter: especially the concrete input/output!!!

To add to this issue, after a discussion we had:
We have no way of telling how a run is intended to be executed, unless it is executed and a run report is generated in any way. So we need a way to declare the intention, which combination of cwl and yml file should be executed for the specific run.

Originally, there was the arc.cwl in the root, which should execute the whole ARC upon running. This was dropped for ease of use and to not overcomplicate things as I understood it. This would be one possibility to get the connection of workflow/tool file and jobfile for a run. The other possibility would be the example I posted above:

cwlVersion: v1.2
class: CommandLineTool
baseCommand: [cwltool, ../../workflows/MyWorkflow/workflow.cwl, run.yml]
outputs:
  myOutput:
    type: Directory
    outputBinding:
      # this returns the whole output directory
      glob: $(runtime.outdir)/myDir

One of those two possibilities, or a third one that handles it, should be implemented to get that connection info. It would also be useful to get input from other people working with ARCs, what they prefer for ease of use. What do you think about this issue @Brilator and @floWetzels ?

Do I understand the question correctly: how do we document what "run.yml" + "workflow.cwl" combination yield what output?
The way I currently do is similar to above, heaving a readme in the respective runs folder with something like
cwltool ../../workflows/MyWorkflow/workflow.cwl run.yml.

Plus I was planning to collect the overall ARC analysis / workflows with one arc.cwl in the root (currently more for visualization of the in-and-outs).

Yes, thats the question here. The arc.cwl in the root you mention would be the first case with the arc.cwl that executes the whole run. A readme in the runs folder also solves the question, at least for the user reading the ARC. The problem there would be how we ensure, that it follows a specific format and is also machine readable, so we can include it in the ARC datamodel.

Yes, I meant to confirm, that my non-machine-readable solution was aiming in the same direction.

Not sure about your outputs bound to directory. Or is this just one example and one would have to adapt for other workflows?

This output would vary between runs. Each run.cwl would have the directory where the run is stored written there

I would for now add Version 2 to the ARC specification. This way we have a way to accurately identify the intention of run execution and the run execution itself. If in the future a better solution comes up, this could be subject to change again.