aws/amazon-genomics-cli

--inputsFile is not being passed to miniwdl run

Opened this issue Β· 13 comments

nh13 commented

The parameters specified in the JSON given to --inputsFile are not being propagated to miniwdl run

I am using AGC release1.5 with WDL and miniWDL. I have a directory of WDL files, so in my sourceURL in my agc-project.yaml points to the top-level directory. In that directory, I have a MANIFEST.json with the top level workflow:

{
	"mainWorkflowURL": "./workflows/top_level_workflow.wdl"
}

I have a local JSON file that contains all the inputs needed by the WDL workflow, both parameters and S3 URIs.

I then call:

agc workflow run \
    <workflow-name> \
    --context <context-name> \
    --inputsFile /local/path/to/inputs.json;

My Batch job fails.

  | 2022-06-01T15:00:15.088-07:00 | {
  | 2022-06-01T15:00:15.088-07:00 | "error": "InputError",
  | 2022-06-01T15:00:15.088-07:00 | "message": "missing required inputs for <workflow-name>: fastq_1, fastq_2, threads, ...
  | 2022-06-01T15:00:15.088-07:00 | }

Executing miniwdl locally works fine when I change the S3 URIs to local file paths in the inputs.json.

If I add the below to the MANIFEST.json, it runs fine.

    "inputFileURLs": [
        "inputs.json"
    ]

Looking at the docs, inputFileURLs should be optional while --inputsFile says it can be used to provide inputs to the workflow. What am I missing?

mlin commented

I've observed this too

Investigating

Re-reading the above, I see an expectation that --inputsFile as a command line option should blend with or take precedence over contents of MANIFEST.json. I think that should be the case as well and need to verify if the code actually does that.

nh13 commented

If inputsFile is specified on the command line and manifest, the behavior is undefined. In this case, it’s not specified in the manifest, so I’d expect it to use the one given in the command line. Hopefully that helps!

Confirmed that this results in EXECUTOR_ERROR using the read workflow in the demo-wdl-project submitting with the command:

agc workflow run -c miniContext read --inputsFile ./workflows/read/read.inputs.json

and the following cases for MANIFEST.json

case 1

{
  "mainWorkflowURL": "main.wdl"
}

case 2:

{
  "mainWorkflowURL": "main.wdl",
  "inputFileURLs": []
}

Working on a fix for this. There's an open question about what should happen when inputsFile is declared in the MANIFEST and on command line. I see three logical options

  1. We send them all to the engine and the engine processes them according to it's rules
  2. We merge them. Any duplicates are resolved by giving precedence to the inputs file declared on the command line
  3. Any inputs file declared by the MANIFEST is totally ignored if the command line argument is used

Out of these I think:

  • 1 or 3 are the easiest to implement.
  • 1 works well with engines like Cromwell that can handle this but would not be a great experience for engines that cannot
  • 2 makes it easy to over-ride only a few inputs and have your defaults present in the file referenced by the manifest.
  • 3 is very simple but puts the burden on the user to do any merging or copy/ paste. It does make it very explicit what will happen (e.g. no surprise configuration in that other inputs file you forgot about, no failed overrides due to typos)

Which solution do people prefer?

nh13 commented

I prefer 2 as I have static genomic reference data which could be in the manifest, and per-run sample genomic data and metadata that need to be custom for each run. Having the former in the manifest, and the latter in a json on the command line makes that split easy to understand.

I have added additional logging statements to the agc workflow run logic so that --verbose output reveals what is going on. It seems that having an inputs.json defined in both the MANIFEST and on command line leads to a double handling with both sets of inputs going to S3. Inspecting the output shows that miniwdl uses the inputs defined by the inputs.json defined by the MANIFEST.

The resolution is then to merge all input sources with command line taking precedence and work from that final source to upload the consolidated artifacts. We probably have to update the uploaded MANIFEST (if any) to point to this new resolved copy of the consolidated inputs.json

agc workflow run -c miniContext words-with-vowels --inputsFile /tmp/words-with-vowels.inputs.json --verbose

2022-06-08T13:22:51-04:00 ↓  Checking AGC version...
2022-06-08T13:22:51-04:00 π’Š  Running workflow. Workflow name: 'words-with-vowels', InputsFile: '/tmp/words-with-vowels.inputs.json', OptionFile: '', Context: 'miniContext'
2022-06-08T13:22:51-04:00 ↓  reading project specification
2022-06-08T13:22:51-04:00 ↓  reading specification of 'words-with-vowels' workflow
2022-06-08T13:22:51-04:00 ↓  workflow type: 'wdl' version: '1.0', workflow source url: 'workflows/words'
2022-06-08T13:22:51-04:00 ↓  current user id: 'mrschre4GqyMA'
2022-06-08T13:22:51-04:00 ↓  obtaining spec for context 'miniContext'
2022-06-08T13:22:51-04:00 ↓  using engine 'miniwdl' from context 'miniContext'
2022-06-08T13:22:51-04:00 ↓  checking deployment status of 'miniContext' context
2022-06-08T13:22:54-04:00 ↓  using output bucket 'agc-1234567890123-us-east-1'
2022-06-08T13:22:54-04:00 ↓  parsed workflow location as 'workflows/words'
2022-06-08T13:22:54-04:00 ↓  workflow location is local? 'true', upload is required? 'true'
2022-06-08T13:22:54-04:00 ↓  workflow upload base object key is 'project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels'
2022-06-08T13:22:54-04:00 ↓  workflow path is '/Users/mrschre/devel/amazon-genomics-cli/examples/demo-wdl-project/workflows/words
2022-06-08T13:22:54-04:00 ↓  workflow '' path is a directory, packing contents ...
2022-06-08T13:22:54-04:00 ↓  recursively copying content of '/Users/mrschre/devel/amazon-genomics-cli/examples/demo-wdl-project/workflows/words' to '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158'
2022-06-08T13:22:54-04:00 ↓  updating file references and loading packed content to 'agc-1234567890123-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels'
2022-06-08T13:22:54-04:00 ↓  reading manifest in '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158
2022-06-08T13:22:54-04:00 ↓  manifest declares '1' input files
2022-06-08T13:22:54-04:00 ↓  reading content of input file at '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158/words-with-vowels.inputs.json'
2022-06-08T13:22:54-04:00 ↓  content of '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158/words-with-vowels.inputs.json' is 
{
  "Words_With_Vowels.word_file": "./mieliestronk-words.txt"
}
2022-06-08T13:22:54-04:00 ↓  inspecting key value pair, 'Words_With_Vowels.word_file: ./mieliestronk-words.txt'
2022-06-08T13:22:54-04:00 ↓  input value './mieliestronk-words.txt' can be resolved to a file at '/Users/mrschre/devel/amazon-genomics-cli/examples/demo-wdl-project/workflows/words/./mieliestronk-words.txt'
2022-06-08T13:22:54-04:00 ↓  loading 'mieliestronk-words.txt' to 'project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/mieliestronk-words.txt'
2022-06-08T13:22:56-04:00 ↓  updated reference '0' to 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/mieliestronk-words.txt'
2022-06-08T13:22:56-04:00 ↓  key value pair updated to 'Words_With_Vowels.word_file: s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/mieliestronk-words.txt'
2022-06-08T13:22:56-04:00 ↓  updloading '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2489931138' to 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/workflow.zip
2022-06-08T13:22:56-04:00 ↓  cleaning up '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2489931138'
2022-06-08T13:22:56-04:00 ↓  workflow artifacts at 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/workflow.zip' will be used to run the workflow
2022-06-08T13:22:56-04:00 ↓  Input file override URL: /tmp/words-with-vowels.inputs.json
2022-06-08T13:22:56-04:00 ↓  content is:
'{
  "Words_With_Vowels.word_file": "./words.txt"
}
'
2022-06-08T13:22:56-04:00 ↓  moving local inputs from '/tmp' to s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data and replacing paths with S3 paths
2022-06-08T13:22:56-04:00 ↓  inspecting key value pair, 'Words_With_Vowels.word_file: ./words.txt'
2022-06-08T13:22:56-04:00 ↓  input value './words.txt' can be resolved to a file at '/tmp/./words.txt'
2022-06-08T13:22:56-04:00 ↓  loading 'words.txt' to 'project/Demo/userid/mrschre4GqyMA/data/words.txt'
2022-06-08T13:22:56-04:00 ↓  updated reference '0' to 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt'
2022-06-08T13:22:56-04:00 ↓  key value pair updated to 'Words_With_Vowels.word_file: s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt'
2022-06-08T13:22:56-04:00 ↓  arguments are: '{"Words_With_Vowels.word_file":"s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt"}'
2022-06-08T13:22:56-04:00 ↓  using context infrastructure from cloudformation stack 'Agc-Context-Demo-mrschre4GqyMA-miniContext'
2022-06-08T13:22:58-04:00 ↓  workflow will be submitted to wes endpoint at 'https://w2xbju8tb0.execute-api.us-east-1.amazonaws.com/prod/'
2022-06-08T13:22:58-04:00 ↓  constructing API client for WES endpoint at 'https://w2xbju8tb0.execute-api.us-east-1.amazonaws.com/prod/'
2022-06-08T13:22:58-04:00 ↓  EstablishWesConnection(https://w2xbju8tb0.execute-api.us-east-1.amazonaws.com/prod/ga4gh/wes/v1)
2022-06-08T13:22:58-04:00 ↓  Ping WES endpoint until we get an answer:
2022-06-08T13:22:58-04:00 ↓  Attempt 1 of 3
2022-06-08T13:23:00-04:00 ↓  Connected to WES endpoint
2022-06-08T13:23:00-04:00 ↓  saved attachment for argument '{"Words_With_Vowels.word_file":"s3://agc-123456789012-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt"}' to '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/words-with-vowels.inputs.json_2587074133'
2022-06-08T13:23:00-04:00 ↓  workflow parameter of 'workflowInputs' is 'words-with-vowels.inputs.json_2587074133'
2022-06-08T13:23:00-04:00 ↓  running workflow at 's3://agc-123456789012-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/workflow.zip' with language 'wdl' and version '1.0' using attachments '[/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/words-with-vowels.inputs.json_2587074133]', workflow parameters 'map[workflowInputs:words-with-vowels.inputs.json_2587074133]' and engine parameters 'map[]'
2022-06-08T13:23:01-04:00 ↓  recording workflow run metadata for workflow run id 'fb6c4633-4b29-4eb8-896f-f60a998cc8c9' to DynamodDB
2022-06-08T13:23:02-04:00 ↓  cleaning up '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/words-with-vowels.inputs.json_2587074133'

BTW, code with additional verbose logging is available at https://github.com/aws/amazon-genomics-cli/tree/markjschreiber/issue-474 branch

nh13 commented

I think this also blocks running multiple workflows concurrently in the same context, since the MANIFEST.json with a per-workflow inputFileURLs specified in the MANIFEST.json is packaged with the workflow (workflow.zip). The workflow.zip is 1:1 with the context, and not with the agc workflow run. This is certainly unexpected, as I'd expect that agc workflow run would not overwrite any previous definition of the workflow. So I think this is a bigger issue than just convenience.

Edit: see #497

Interesting observation @nh13, indeed there is only one workflow.zip per context. Do you know how this affects the concurrent execution of workflows with Cromwell?
I have launched four workflows one after the other with a delay of between 1 and 10 minutes and they all seem to have picked the right input. They all use the same MANIFEST and I then pass the sample-specific input file with the -inputsFile parameter. From what I have seen, workflow.zip in s3 is rewritten after each agc workflow run but perhaps this file is only used transiently? If so, is there a recommended time we should wait between agc workflow run commands?

I think this also blocks running multiple workflows concurrently in the same context, since the MANIFEST.json with a per-workflow inputFileURLs specified in the MANIFEST.json is packaged with the workflow (workflow.zip). The workflow.zip is 1:1 with the context, and not with the agc workflow run. This is certainly unexpected, as I'd expect that agc workflow run would not overwrite any previous definition of the workflow. So I think this is a bigger issue than just convenience.

Please escalate this to its own issue

I am also seeing this behavior with the nextflow engine as well.