fairagro/M4.4_UC6_ARC

CWLMake

Closed this issue · 11 comments

  • Überblick prior art, eg. https://github.com/tom-tan/zatsu-cwl-generator checken.
  • minimales Templating-System um CWL einfacher zu schreiben.
  • möglicherweise ausbaubar um CWL mit Provenance-Informationen (Linked Data) auszustatten.

prior art
besides https://github.com/tom-tan/zatsu-cwl-generator (commit 2 years ago) there is also :

a tool for visual editing of CWL Files Rabix Composer (commit 3 years ago) which uses CWL-SVG (commit 1 year ago). Composer is now developed as closed source tool for sbgenomics.

and various libraries for Python, JS/TS, C++, C#/F#, D, Java, R - most of them being auto generated from the specs

Comparison between workflow languages executing a simple python script with a string parameter and writing back an output file:

CWL CWL (alternative)
cwlVersion: v1.2
class: CommandLineTool

requirements:
  InitialWorkDirRequirement:
    listing:
    - entryname: print.py
      entry:
        $include: ./print.py

inputs:
  message:
    type: string
    default: Hello World
    inputBinding:
      position: 1

outputs:
  output_file:
    type: File
    outputBinding:
      glob: helloworld.txt

baseCommand: [python3, ./print.py]
cwlVersion: v1.2
class: CommandLineTool

inputs:
  file:
    type: File
    default:
      class: File
      path: ./print.py
    inputBinding:
      position: 1
  message:
    type: string
    default: Hello World
    inputBinding:
      position: 2

outputs:
  output_file:
    type: File
    outputBinding:
      glob: helloworld.txt

baseCommand: python3
Nextflow SnakeMake
process helloworld{
  input:
  val greeting
  
  output:
  file 'helloworld.txt'

  script: 
  """ python3 ${projectDir}/print.py  \"${greeting}\" """
}

params.greeting = "Hello World"

workflow {
  helloworld(params.greeting)  
}
rule HelloWorld:
    input:
        thefile="input.txt"

    output: 
        "helloworld.txt"

    shell:
        "greeting=\"$(cat {input.thefile})\" && "
        "python3 ./print.py \"$greeting\""

Currently the CommandLine Tools are created with a set of Python Scripts doing some Regex with the R files, combined with some hints in the comments of those files which was the interim solution to complete

One could also use libraries such as BaklavaJS als node graph frontend to create Workflows. Similar to the deprecated Composer App ...

I did a quick (local) prototype yesterday. It loads CWL CommandLineTools from Disk (using FileSystemHandle API) and adds them as Nodes. One could export the graph as Workflow i think - this is not implemented though.
proto

There is also CWL-SVG which is used by the Composer App but there are issues from 2018 still open with no answer and the standalone sample does not work anymore due to a package registry not existing anymore.

See

  • #30
  • fairagro/m4.4_concept#25

There also is WDL (Workflow Description Language) for which converters to CWL seem to already exist( https://github.com/common-workflow-lab/wdl-cwl-translator (Last commit yesterday)) and there is a huge amount of tools available in Dockstore. Both CWL and WDL are supported by the toil-Runner: https://toil.readthedocs.io/en/latest/
OpenWDL released Version 1.2 of their spec earlier this year. Syntax looks like if CWL and Nextflow had children^^
🤔

version 1.2

task hello_task {
  input {
    File infile
    String pattern
  }

  command <<<
    grep -E '~{pattern}' '~{infile}'
  >>>

  requirements {
    container: "ubuntu:latest"
  }

  output {
    Array[String] matches = read_lines(stdout())
  }
}

workflow hello {
  input {
    File infile
    String pattern
  }

  call hello_task {
    infile, pattern
  }

  output {
    Array[String] matches = hello_task.matches
  }
}

WDL oder Nextflow?

Jens macht pro- und konta-Liste. Entscheidung auch vor nächstem Meeting.

CWL vs WDL vs Nextflow

tl;dr: CWL is [suboptimal/verbose/not used] but seems to be the best tool for our usecase

Numbers

CWL WDL Nextflow
GitHub
# of GitHub Repos 1k 1k 5k
# of GitHub Users 163 249 1k
# of GitHub Stars 1.4k 759 2.6k
# of contributors 65 51 170
last commit to main spec repo last year (2 weeks to spec 1.2 repo) 3 months 2 days
License Apache 2.0 BSD 3-Clause Apache 2.0
Entries on...
... WorkflowHub 81 12 129
... Dockstore 226 3245 129
... nf-core 0 0 97

CWL has common BioTools at https://github.com/common-workflow-library/bio-cwl-tools

Who?

CWL Community Driven with Governance Comitee (Members from Arvados, Sevenbridges Genomics, University of Manchester, ... + 1 Galaxy & 1 WDL Member)
WDL: Community Driven with Governance Comitee (Members from Chan Zuckerberg Initiative, Microsoft, Amazon, Broad Institute, DNAStack, ...)
Nextflow: Sequera Labs, Centre for Genomic Regulation; (Funding: Chan Zuckerberg Initiative, Sequera)

Hello World Workflow (Syntax comparison)

CWL WDL Nextflow
cwlVersion: v1.2

class: CommandLineTool
baseCommand: echo

inputs:
  message:
    type: string
    default: "Hello World"
    inputBinding:
      position: 1
outputs: []
version 1.0 

workflow HelloWorld {
  call WriteGreeting
}

task WriteGreeting {
  command {
     echo "Hello World"
  }
  output {
     File output_greeting = stdout()
  }
}
params.str = 'Hello World'

process greeting {
    input:
    val greeting

    output:
    stdout

    """
    echo ${greeting}
    """
}

workflow {
    greeting(params.str)
}

Calling a Script (Syntax comparison)

CWL WDL Nextflow
cwlVersion: v1.2
class: CommandLineTool

requirements:
  InitialWorkDirRequirement:
    listing:
    - entryname: print.py
      entry:
        $include: ./print.py

inputs:
  message:
    type: string
    default: Hello World
    inputBinding:
      position: 1

outputs: 
  messages:
    type: stdout

baseCommand: [python3, ./print.py]
version 1.1

task greeting {
    input {
      String the_input
      File the_file
    }
    command {
        python ~{the_file} ~{the_input}
    }
    output {
        File result = stdout()
    }
    runtime {
        container: "python:latest"
    }
}

workflow HelloWF {
    input {
        String the_input
        File the_file = "print.py"
    }
    call greeting { 
        input: 
            the_input = the_input,
            the_file = the_file     
    }
    output { }
}
process helloworld{
  input:
  val greeting
  
  output:
  file 'helloworld.txt'

  script: 
  """ python3 ${projectDir}/print.py  \"${greeting}\" """
}

params.greeting = "Hello World"

workflow {
  helloworld(params.greeting)  
}

The local runner "miniWDL" does not support 1.2 as of now! ⛔
As far as i can see you can not use a local file without sending it as parameter... unless it is part of the container. This is true for WDL and Nextflow! Could be a dealbreaker! ⛔
One has to add a config file to make it work with the parameter being set to default (error thrown by miniWDL)⛔

[file_io] 
allow_any_input = true

The WDL2CWL Translator works and outputs this file which looks suboptimal but works. But this is a very simple WDL script! There are some test cases in the Translator Repo which are more complicated...

WDLCWL
version 1.1

task greeting {
    input {
      String the_input
      File the_file
    }
    command {
        python ~{the_file} ~{the_input}
    }
    output {
        File result = stdout()
    }
    runtime {
        container: "python:latest"
    }
}

workflow HelloWF {
    input {
        String the_input
        File the_file = "print.py"
    }
    call greeting { 
        input: 
            the_input = the_input,
            the_file = the_file     
    }
    output { }
}
cwlVersion: v1.2
id: HelloWF
class: Workflow
requirements:
  - class: InlineJavascriptRequirement
inputs:
  - id: the_input
    type: string
  - id: the_file
    default:
        class: File
        path: print.py
    type: File
steps:
  - id: greeting
    in:
      - id: the_input
        source: the_input
      - id: the_file
        source: the_file
    out:
      - id: result
    run:
        class: CommandLineTool
        id: greeting
        inputs:
          - id: the_input
            type: string
          - id: the_file
            type: File
        outputs:
          - id: result
            type: stdout
        requirements:
          - class: InitialWorkDirRequirement
            listing:
              - entryname: script.bash
                entry: |4

                    python $(inputs.the_file.path) $(inputs.the_input)
          - class: InlineJavascriptRequirement
          - class: NetworkAccess
            networkAccess: true
        hints:
          - class: ResourceRequirement
            outdirMin: 1024
        cwlVersion: v1.2
        baseCommand:
          - bash
          - script.bash
outputs: []

I also asked ChatGPT to wrap the code into a CWL CommandLineTool using Docker which in this case worked suprisingly well... Only issue the file "output.txt" is not copied back to the local dir when used like this, i had to change a very small bit

ChatGPTWorks as expectedOriginal File from above (not using docker)
cwlVersion: v1.0
class: CommandLineTool

inputs:
  input_string:
    type: string
    inputBinding:
      position: 1

outputs: []
stdout: output.txt

baseCommand: python
arguments:
- -c
- |
  import sys
  print(sys.argv[1])

hints:
  DockerRequirement:
    dockerPull: python:3.9
cwlVersion: v1.0
class: CommandLineTool

inputs:
  input_string:
    type: string
    inputBinding:
      position: 1

outputs: 
    output.txt:
        type: stdout

baseCommand: python
arguments:
- -c
- |
  import sys
  print(sys.argv[1])

hints:
  DockerRequirement:
    dockerPull: python:3.9
cwlVersion: v1.2
class: CommandLineTool

requirements:
  InitialWorkDirRequirement:
    listing:
    - entryname: print.py
      entry:
        $include: ./print.py

inputs:
  message:
    type: string
    default: Hello World
    inputBinding:
      position: 1

outputs: 
  messages:
    type: stdout

baseCommand: [python3, ./print.py]

Target format should still be CWL as this is what is accepted in the community. However due to the Translator for WDL being available one could encourage users to also write WDL. However a single file is produced not being able to mix and match individual CWL-CommandLineTools without manually splitting the file. A Nextflow converter is not available as Nextflow is way more powerful. One could implement it for a subset of features as the typical use case seems to be script execution as there are no widespread tools like in the Bioinformatics fields.

Looking at the r/bioinformatics subreddit it looks like Nextflow is the only one of this three languages that is adopted widely enough to have recent threads about it.

One could still consider GUI Tools like Rabix Composer for CWL (which is deprecated) as this is what HELIPORT seems to uses judging from Screenshots or use templates for these special use cases.

Other Consortia

NFDI4Biodiv however plans to use Nextflow regarding to their latest proposal. This is what is supported by CloWM (developed by NFDI4Microbiota). Whereas DataPLANT uses/wants CWL for ARCs.

Note: There is a requirements document (Requirements on workflow tools) from NFDI4Ing available: https://nfdi4ingscientificworkflowrequirements.readthedocs.io/en/latest/docs/requirements.html#evaluation

Opinionated Pro/Contra List

CWL WDL Nextflow
General:
Verbosity CWL is mega verbose
Documentation 🔘 🔘 overall ok
Script Execution in Docker works best in CWL "requirements" can be cool
Working with containers works with all, is default in WDL
Output into filesystem CWL outputs req. files, others spam logs into fs
Speed 🔘 🔘 🔘 all about the same
Parsable 🔘 🔘 🔘 = Grammar available, CWL=YAML
Simplicity:
official GUI Galaxy has one
Ease of first use 🔘 🔘
Metadata Only CWL supports annotation
Conversion:
Convertible to CWL
Convertible to WDL 🔘/❌ CWL: outdated tool
Convertible to Nf 🔘/❌ CWL: outdated tool
Community:
Size of Community 🔘/🔘 🔘/✅ ✅/🔘 by # of Repos and Tools
Forum/StackOverflow/Reddit Nextflow most active

✅ good
🔘 ok
❌ bad
◽ invalid

However simple bash scripts would win most categories 😜

With all the testing i did, CWL still seems to be the right choice for our use cases (most likely: executing an existing script inside a container). CWL might be verbose but gives the users the most level of control.
For Bioinformatics Nextflow and WDL sure are the best choices as all tools are already available. Also its a small set of commandline tools which can easily installed. they are most likely the best pick. But they both expect scripts (like the BL ones) as part of their inputs which makes them overridable and would just have to wrap "Rscript" as a Tool which is ok but the README Graph for example would say "Rscript" for each step then...

Opinions?