couler-proj/couler

Support script parameters

flaviuvadan opened this issue · 3 comments

The following is a description based on my experience and my understanding of the Couler code for scheduling Argo workflows. Please feel free to correct and misunderstanding on my part. If the following reported issue/enhancement proposal is already on the development radar I hope this will at least provide further support for it or start a discussion around the idea! In addition, thanks for all the development that is undergoing on Couler and Argo :)

Summary

It would be amazing if Couler would support passing in/specifying parameters that are dedicated for interpretation/usage in the scripts that are submitted via couler.run_script(...). Currently, if args are specified they get used as container args in addition to being added to the template arguments and step argument specification. Following are some code blocks that serve for reproducibility:

import couler.argo as couler
from couler.argo_submitter import ArgoSubmitter
from couler.core.constants import ImagePullPolicy

def start():
    def _start():
        print('{{input.parameters}}')   # just print all of them to see what's in it
        
    return couler.run_script(
            'python:3.7',
            source=_start,
            args=[arg],   # main issue
            step_name='start',
            image_pull_policy=ImagePullPolicy.IfNotPresent,
    )
start()
submitter = ArgoSubmitter()
couler.run(submitter=submitter)

The script above results in the following log:

python: can't open file 'start': [Errno 2] No such file or directory

As you can see, the container command python is accompanied by the "start" string. The template does contain the parameters, so the _start code should be able to access the parameters. Here's the YAML spec:

name: couler-testing
inputs: {}
outputs: {}
metadata: {}
steps:
  - - name: start-112
      template: start
      arguments:
        parameters:
          - name: para-start-0
            value: start
---
templates:
  name: start
  inputs:
    parameters:
      - name: para-start-0  # this should be accessible
  outputs: {}
  metadata: {}
  script:
    name: ''
    image: 'python:3.7'
    command:
      - python
    args:
      - '{{inputs.parameters.para-start-0}}'  # but this makes things blow up
    resources: {}
    imagePullPolicy: IfNotPresent
    source: |
  
      print('{{input.parameters}}')

One way I am currently getting around this limitation is by setting environment variables with the passed parameters, as follows:

def start(arg):
    def _start(arg):
        from os import getenv
        from json import loads
        print(loads(getenv('parameter')))
        
    return couler.run_script(
            'python:3.7',
            source=__start,
            env={'parameter': dumps({f'd{i}': i for i in range(5)})},
            step_name='start',
            image_pull_policy=ImagePullPolicy.IfNotPresent,
    )

The above results in the following log:

{'d0': 0, 'd1': 1, 'd2': 2, 'd3': 3, 'd4': 4}

As you may imagine, it's challenging to choose the proper environment variable name: what if users overwrite an important name? What if you have users who may not have the necessary experience to have the idea of checking what environment parameters are overwritten? If something gets overwritten, the user is not notified/observability is limited. As a potentially unrelated note, the above example is part of a much bigger workflow I am currently developing; each function in this workflow would need its own parameters, as one may infer.

What change needs making?

Potentially, a way to differentiate between script parameters and container parameters. I believe the inputs are a way to solve this but not all users have the desire to store everything in a bucket so parameters can be used as artifacts.

I am happy to attempt to solve this problem but I would love to first have a discussion of a potential way of getting around this that would best solve community use cases.

Use Cases

Users who want to run quick experiments by submitting simple functions to Argo as they want to take advantage of the resource orchestration capacity of Argo. This supports increased development velocity by abstracting away the need to construct custom containers that have scripts which can be specified on the command line, along with their particular parameters. Python images can grow to 5Gi+, depending on the number of big packages users install. Therefore, a way to quickly submit functions for processing in containers that already have the required dependencies would be very valuable and perhaps promote wider adoption of Argo.

When would you use this?

I think I covered this in User Cases so please refer to that.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

Thanks for the well-written proposal. Would you like to propose a way to distinguish the args? Also cc @rushtehrani who's previously contributed a fix on script template args.

Thanks for the well-written proposal. Would you like to propose a way to distinguish the args? Also cc @rushtehrani who's previously contributed a fix on script template args.

Thanks for the reply @terrytangyuan. Sure, let me take a quick look at Argo, then here, and see what we have as options for potentially implementing this.

The change I made excluded artifacts but not arguments. I believe the behavior in run_container is the same at the moment.

I thought about using container_args/args or args/arguments but that would break backward compatibility and I didn't get a chance to dig deeper into it.