My YAML file appears to not get processed properly
Chris-Schnaufer opened this issue ยท 29 comments
My YAML file shows that the name
field is missing, although it's there. Also, trying to run reports an error that the image
configuration key is missing, even though it's there as well. My YAML file: https://github.com/Chris-Schnaufer/drone-makeflow/blob/main/plantit.yaml
Hi @Chris-Schnaufer thanks for reporting this! Seems there is indeed a bug in the configuration file validation logic, KeyError
trying to access output.path
in the configuration even though that attribute isn't required:
"validation" : {
"errors" : [
"Traceback (most recent call last):\n File \"/code/plantit/plantit/github.py\", line 404, in list_connectable_repos_by_owner\n validation = validate_repo_config(config, token)\n File \"/code/plantit/plantit/github.py\", line 111, in validate_repo_config\n if config['output']['path'] is not None and type(config['output']['path']) is not str:\nKeyError: 'path'\n"
],
"is_valid" : false
}
I think the missing name
and image
are cascading consequences of this
Hi @Chris-Schnaufer, this is addressed in v0.0.33 (out tonight), looks like the drone-makeflow
repo is publicly visible now.
I did notice that in the plantit.yaml
, the jobqueue
section has the old time
and mem
attributes. These were changed to walltime
and memory
a few releases back. The old attribute names should still work for now but will be deprecated in a future release. The docs have now been updated to reflect this. Apologies for the delay here.
Thanks @Chris-Schnaufer, found the first problem (the 'Selected' alert doesn't handle the case for directory inputs, only file/files). Will patch shortly. Looking into the MIAPPE project binding too.
Really appreciate your help revealing all these issues- I'm not able to reliably test the whole UI surface alone.
Hi @Chris-Schnaufer, the input selection issue should now be resolved (still working on the project binding fix). Apologies for the delay. Please let me know if you are still unable to submit jobs.
I just tested the pipeline and although the submission is successful, the job fails due to a missing plantit-workflow.sh
script. Checking inside the agdrone/drone-workflow:1.2
image definition it looks like plantit-workflow.sh
does not exist in the /scif/apps/src/
directory:
$ ls /scif/apps/src/
betydb2geojson.py cyverse_canopycover.sh cyverse_plotclip.sh cyverse_soilmask_ratio.sh git_algo_rgb_plot.py merge_csv.py shp2geojson_workflow.jx
betydb2geojson_workflow.jx cyverse_find_files2json.sh cyverse_short_workflow.sh find_files2json.sh git_rgb_plot_workflow.jx merge_csv_workflow.jx soilmask_ratio_workflow.jx
canopycover_workflow.jx cyverse_greenness-indices.sh cyverse_shp2geojson.sh find_files2json_workflow.jx greenness-indices_workflow.jx plotclip_workflow.jx soilmask_workflow.jx
cyverse_betydb2geojson.sh cyverse_merge_csv.sh cyverse_soilmask.sh generate_geojson.sh jx-args.json short_workflow.jx
Hello @w-bonelli. I was able to get back to testing this and I am still having problems. I changed the docker image so that it's pointing to a test version that has the plantit-workflow.sh
file in the correct location: the image is chrisatua/development:drone_makeflow
.
I'm not sure why it's reposting that it can't find the docker image. I've tried uploading the docker image again and there's no change in the run result. Here's two screen shots showing the step before running the Task, and the Task result. I'm running from the main
branch of the following repo: https://github.com/Chris-Schnaufer/drone-makeflow
On another note, I appear to have two projects with the same GitHub path:
Thanks @Chris-Schnaufer looking into this now
Hi @Chris-Schnaufer, I believe the root cause of the latest issue was the docker image attribute was not parsed properly from plantit.yaml
(due to the comment here). This should now be fixed with v0.1.0
. Apologies again for the delay.
I was able to submit AgPipeline/drone-makeflow
without errors last night, however the container workflow did not complete successfully. Here was the error message: /usr/bin/sh: 1: /scif/apps/src/plantit-workflow.sh: not found
@w-bonelli That repository has the incorrect Docker image listed. The one at Chris-Schnaufer/drone-makeflow
is the one that shows up on my system.
Ok, got it. I just did a test run with Chris-Schnaufer/drone-makeflow
and received this output:
INPUT FOLDER /scratch/03203/dirt/plantit/2e10ffd5-a5ec-4a4e-9e00-218866cd81fc/input/canopycover_test_data
WORKING FOLDER /scratch/03203/dirt/plantit/2e10ffd5-a5ec-4a4e-9e00-218866cd81fc
Processing with /scratch/03203/dirt/plantit/2e10ffd5-a5ec-4a4e-9e00-218866cd81fc/input/canopycover_test_data/orthoimage_mask.tif /scratch/03203/dirt/plantit/2e10ffd5-a5ec-4a4e-9e00-218866cd81fc/input/canopycover_test_data/plots.json
Options: --metadata /scratch/03203/dirt/plantit/2e10ffd5-a5ec-4a4e-9e00-218866cd81fc/input/canopycover_test_data/experiment.yaml
/scif/apps/src/plantit-workflow.sh: line 61: /scif/apps/src/jx-args.json: Read-only file system
Running workflow steps: soilmask plotclip find_files2json canopycover merge_csv
Running app 0 'soilmask'
[soilmask] executing /bin/bash /scif/apps/soilmask/scif/runscript
makeflow: line 0: expected a workflow definition as a JSON object but got error("on line 4, SOILMASK_MASK_FILE: undefined symbol") instead
2022/02/28 11:17:22.61 makeflow[30569] fatal: makeflow: couldn't load /scif/apps/src/soilmask_workflow.jx: Invalid argument
Terminated
Running app 1 'plotclip'
[plotclip] executing /bin/bash /scif/apps/plotclip/scif/runscript
makeflow: line 0: expected a workflow definition as a JSON object but got error("on line 10, PLOTCLIP_SOURCE_FILE: undefined symbol") instead
2022/02/28 11:17:22.97 makeflow[30573] fatal: makeflow: couldn't load /scif/apps/src/plotclip_workflow.jx: Invalid argument
Terminated
Running app 2 'find_files2json'
[find_files2json] executing /bin/bash /scif/apps/find_files2json/scif/runscript
makeflow: line 0: expected a workflow definition as a JSON object but got error("on line 10, FILES2JSON_SEARCH_NAME: undefined symbol") instead
2022/02/28 11:17:23.34 makeflow[30578] fatal: makeflow: couldn't load /scif/apps/src/find_files2json_workflow.jx: Invalid argument
Terminated
Running app 3 'canopycover'
[canopycover] executing /bin/bash /scif/apps/canopycover/scif/runscript
2022/02/28 11:17:23.71 makeflow[30582] fatal: Failed to parse in JX Args File.
Terminated
Running app 4 'merge_csv'
[merge_csv] executing /bin/bash /scif/apps/merge_csv/scif/runscript
makeflow: line 0: expected a workflow definition as a JSON object but got error("on line 10, MERGECSV_SOURCE: undefined symbol") instead
2022/02/28 11:17:24.07 makeflow[30586] fatal: makeflow: couldn't load /scif/apps/src/merge_csv_workflow.jx: Invalid argument
Terminated
Workflow completed
Looks like the root issue is that Singularity makes the filesystem read-only by default. So when plantit-workflow.sh
tries to write to /scif/apps/src/jx-args.json
it fails and the error cascades.
Would it be possible to alter the way the agdrone workflow accepts configuration info? E.g., allowing the location of jx-args.json
to be specified at invocation time instead of expecting it to be at /scif/apps/src
?
Another option might be to use the mount
attribute in plantit.yaml
(this configures a Singularity bind mount mapping the specified path to the working directory on the host), then modify plantit-workflow.sh
to git clone
the repo and then write directly to jx-args.json
instead of /scif/apps/src/jx-args.json
. (plantit no longer supports the automatic clone
option in plantit.yaml
because of some headaches re: handling potential duplicate filenames, but there is no reason a workflow can't manually do it)
Hello @w-bonelli, I have looked at this and have some comments (I'm also not knowledgable about Singularity):
allowing the location of jx-args.json to be specified at invocation time
There are dependencies built into the container that would require additional writes to the file system to allow this. So I don't think it would work.
Singularity makes the filesystem read-only by default
What are the writable folders on the system? In other words, how do generated files get saved in the container and exported from the container when it's done? Also, the apps that run expect the /scif
folder to be writable as part of the app management system - can something be done about to enable writing?
Any help on this is appreciated!
Hi @Chris-Schnaufer,
What are the writable folders on the system? In other words, how do generated files get saved in the container and exported from the container when it's done?
Singularity automatically mounts the current directory on the host into the container (as well as /home/$USER
and /tmp
) under the paths as they appear on the host filesystem. Those are the only writable locations by default, if I understand the docs correctly. So something like singularity exec docker://alpine touch test.txt
works, and test.txt
will exist in the host working directory after the container exits, but singularity exec docker://alpine touch /opt/test.txt
fails with a "Read-only file system" warning.
Also, the apps that run expect the /scif folder to be writable as part of the app management system - can something be done about to enable writing?
I think bind mounts might work as an indirect way of making /scif/apps/src
writable by modifying the container's view of the filesystem. Bind mounts allow mapping paths on the host to custom paths within the container. PlantIT supports this via the mount
attribute in the plantit.yaml
file. If you mount /scif/apps/src
(example here), that will overwrite that folder in the container and replace it with the contents of the host working directory, without changing its path as visible to the container. Is there anything at /scif/apps/src
in the image definition that is not present in the GitHub repo? If not, the first step in plantit-workflow.sh
could be to clone the repo into /scif/apps/src
, after which I believe the container could read and write to anything in that folder.
I will try this tomorrow or Fri to check that it works as expected. I wish we could provide more straightforward support for writing arbitrary locations but I think it is a pretty fundamental Singularity limitation.
The system works by checking out the repo and then building the docker image. The built solution is what's run.
Trying this now.
Hi @Chris-Schnaufer my apologies again for the delay. I think I have this working now. See the diff here for the changes.
What I did:
- add bind mounts to
plantit.yaml
for each location the workflow needs to write to - update
Dockerfile
to copyplantit-workflow.sh
into a different location in the container (I usedopt/dev
, but nearly anywhere should work), so the script isn't overwritten when the host's working directory is mounted toscif/apps/src
- update the
commands
entrypoint inplantit.yaml
to reflect the new location of the workflow script - update
plantit-workflow.sh
to pull thedrone-makeflow
repo into (what the container sees as)scif/apps/src
(this is a bit involved since git refuses to clone a repo into an occupied directory)
The workflow seems to run successfully. The job log includes the following output:
Processing with /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/input/canopycover_test_data/orthomosaic.tif /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/input/canopycover_test_data/plots.json
Options: --metadata /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/input/canopycover_test_data/experiment.yaml
Running workflow steps: soilmask plotclip find_files2json canopycover merge_csv
Running app 0 'soilmask'
[soilmask] executing /bin/bash /scif/apps/soilmask/scif/runscript
parsing /scif/apps/src/soilmask_workflow.jx...
local resources: 28 cores, 257741 MB memory, 15 MB disk
max running local jobs: 28
checking /scif/apps/src/soilmask_workflow.jx for consistency...
/scif/apps/src/soilmask_workflow.jx has 1 rules.
creating new log file /scif/data/soilmask/workflow.jx.makeflowlog...
checking files for unexpected changes... (use --skip-file-check to skip this step)
starting workflow....
submitting job: ${SCIF_APPROOT}/.venv/bin/python3 ${SCIF_APPROOT}/${SCRIPT_PATH} ${DOCKER_OPTIONS} --working_space "${WORKING_FOLDER}" "${INPUT_GEOTIFF}"
submitted job 16986
{
"code": 0,
"file": [
{
"path": "/scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/orthomosaicmask.tif",
"key": "stereoTop",
"metadata": {
"data": {
"name": "soilmask",
"version": "2.2",
"ratio": 0.15308405301006114
}
}
}
]
}
job 16986 completed
nothing left to do.
Running app 1 'plotclip'
[plotclip] executing /bin/bash /scif/apps/plotclip/scif/runscript
parsing /scif/apps/src/plotclip_workflow.jx...
local resources: 28 cores, 257741 MB memory, 15 MB disk
max running local jobs: 28
checking /scif/apps/src/plotclip_workflow.jx for consistency...
/scif/apps/src/plotclip_workflow.jx has 1 rules.
recovering from log file /scif/data/plotclip/workflow.jx.makeflowlog...
checking for old running or failed jobs...
checking files for unexpected changes... (use --skip-file-check to skip this step)
starting workflow....
nothing left to do.
Running app 2 'find_files2json'
[find_files2json] executing /bin/bash /scif/apps/find_files2json/scif/runscript
parsing /scif/apps/src/find_files2json_workflow.jx...
local resources: 28 cores, 257741 MB memory, 15 MB disk
max running local jobs: 28
checking /scif/apps/src/find_files2json_workflow.jx for consistency...
/scif/apps/src/find_files2json_workflow.jx has 1 rules.
recovering from log file /scif/data/find_files2json/workflow.jx.makeflowlog...
checking for old running or failed jobs...
checking files for unexpected changes... (use --skip-file-check to skip this step)
starting workflow....
nothing left to do.
Running app 3 'canopycover'
[canopycover] executing /bin/bash /scif/apps/canopycover/scif/runscript
Running app 4 'merge_csv'
[merge_csv] executing /bin/bash /scif/apps/merge_csv/scif/runscript
parsing /scif/apps/src/merge_csv_workflow.jx...
local resources: 28 cores, 257741 MB memory, 15 MB disk
max running local jobs: 28
checking /scif/apps/src/merge_csv_workflow.jx for consistency...
/scif/apps/src/merge_csv_workflow.jx has 1 rules.
recovering from log file /scif/data/merge_csv/workflow.jx.makeflowlog...
checking for old running or failed jobs...
checking files for unexpected changes... (use --skip-file-check to skip this step)
starting workflow....
nothing left to do.
Workflow completed
And the workflow.jx.makeflowlog
file contents:
# NODE 0 ${SCIF_APPROOT}/.venv/bin/python3 ${SCIF_APPROOT}/${SCRIPT_PATH} ${DOCKER_OPTIONS} --working_space "${WORKING_FOLDER}" "${INPUT_GEOTIFF}"
# CATEGORY 0 default
# SYMBOL 0 default
# PARENTS 0
# SOURCES 0 /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/input/canopycover_test_data/orthomosaic.tif
# TARGETS 0 /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/orthomosaicmask.tif
# COMMAND 0 ${SCIF_APPROOT}/.venv/bin/python3 ${SCIF_APPROOT}/${SCRIPT_PATH} ${DOCKER_OPTIONS} --working_space "${WORKING_FOLDER}" "${INPUT_GEOTIFF}"
# FILE 1647819147640506 /scif/data/soilmask/workflow.jx.batchlog 1 0
# STARTED 1647819147659151
# FILE 1647819147677957 /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/orthomosaicmask.tif 1 1073741824
1647819147678183 0 1 16986 0 1 0 0 0 1
# FILE 1647819150192981 /scratch/wpb36237/plantit/19675bd2-b4c6-4d5a-954a-5bada2c426e3/orthomosaicmask.tif 2 3807181
1647819150193049 0 2 16986 0 0 1 0 0 1
# COMPLETED 1647819150193083
# FILE 1647819150301904 /scif/data/plotclip/workflow.jx.batchlog 1 0
# STARTED 1647819150328675
# COMPLETED 1647819150330269
# FILE 1647819150438240 /scif/data/find_files2json/workflow.jx.batchlog 1 0
# STARTED 1647819150446188
# COMPLETED 1647819150447752
# FILE 1647819150595902 /scif/data/merge_csv/workflow.jx.batchlog 1 0
# STARTED 1647819150602725
# COMPLETED 1647819150604079
I'm not sure what output files are expected, so I'm not sure how to validate results. You may need to specify exact names of expected output files in the output.include.names
section of plantit.yaml
(or use quite a few output.exclude.names
), since the job working directory will have everything from the drone-workflow
repo in it.
Hope this helps. Please let me know if I can do anything else and thanks again โ has been a really valuable edge case to explore and figure out how to support.
Wow! Thanks @w-bonelli! This looks great so far and I will look into it further.
Regarding the mounts, is there a reason that the folders under /scif/data/* are separate mounts versus mounting only /scif/data
?
@Chris-Schnaufer No problem! The workflows refresh every 5 minutes โ you may need to reload the page to see the changes reflected. It would be nice to be able to manually refresh particular workflows though. I'll add that to the roadmap.
I also have 4, it looks like 2 branches under AgPipeline/drone-makeflow
(main and develop), 1 under Chris-Schnaufer/drone-makeflow
, and my own fork w-bonelli/drone-makeflow
.
De-duplication is planned for branches of the same repo (tracked here) but I have not gotten to it yet.
In the meantime, we can add one of the workflows to the Featured
context if you'd like, so it shows up immediately when the user navigates to the workflows view.
@w-bonelli thanks for the quick response. I'm not ready to have the workflow featured yet but I will let you know when I think it's ready ๐
Hello @w-bonelli, I am still seeing these issues. Any updates? Thanks
Hi @Chris-Schnaufer apologies for the delay, which issue are you seeing? The Stampede2 agent is no longer publicly available but there is a Sapelo2 agent you should be able to submit to.
Hello, please see above comment (the last image shows the Sapelo2 issue) Please ignore the Stampede2 agent since it's no longer available. Comment link: #259 (comment)
Is it the authentication failed error? Would you mind sharing a more recent task ID?
Yes, the error reads Failed to grant temporary data access
.
I tried to reproduce the problem but I am unable to select the Sapelo2
option
That's likely because the agdrone workflow requests 8 cores, while the public Sapelo2 agent allows a max of 2. If you update your plantit.yaml
to request <=2 cores I think it should allow you to submit. I will see if I can reproduce the issue on my own fork later today
In the longer term we're changing the way orchestration works to take advantage of GitHub actions, so plantit
won't have to manually manage agents or poll clusters for job status. This will involve changes to the plantit.yaml
specification but ultimately it will allow you to plug in your own cluster, removing limitations like this
I'll update this thread as that work progresses
I think this occurred because this task's output location is your home folder /iplant/home/schnaufer
. The data store only allows granting guest permissions to write to collections inside the user's home directory, not the home folder itself.
Thanks for bringing this to light, I've updated the site to disallow selecting the top-level home folder as the output location.
I am now able to run my workflows. Will open a new issue for problems found