DaCe Orchestration issues
Opened this issue · 1 comments
Describe the bug
DaCe orchestration failing when unused variables in pySHiELD
microphysics
module parsed. Specifically the sedimentation
and icloud
methods were passed unused variables, d0_vap
, lv00
, and cracw
which would result in a ValueError
. Variables have been since removed in PR 15
To Reproduce
Replace d0_vap
and lv00
into sedimentation
method and calls, and cracw
into icloud
method and calls. Make sure to use a dace
backend, with the FV3_DACEMODE
environment variable set to BuildAndRun
to enable orchestration and running of the model in pace
. To start run use:
mpirun -n X python -m pace.run <config yaml>
Expected behavior
DaCe should be able to use orchestration regardless of variable usage.
System Environment
Describe the system environment, include:
- OS: RHEL 8.9
- Backend used: dace:cpu
- Environment variables set: FV3_DACEMODE=BuildAndRun
- Compiler(s): gcc/12.3.0, python/3.11,
- MPI type, and version: openmpi/5.0.0
- netCDF Version: netcdf/4.9.2
- If this bug came from a model run, which model: baroclinic_c12_orch_cpu.yaml
Another case popped up of the same issue: unused variables create a parsing issue in orch:dace:X
.
Digging the issue lives in the gt4py/dace bridge.
In orchestration the StencilFactory
uses a lazy_stencil
(code) to defer build at JIT time. This system refers to the DaCeLazyStencil
(code).
1/ When DaCe takes over to move the code under the SDFG IR, the __sdfg__
function gets executed. The stencil get packed into an unexpanded
version which signature is given here as a simple list of the declared arguments.
2/ Later on as orchestration progresses, DaCe will turn the OIR (GT last IR) into SDFG (here). The bug manifests here where the bridge attempts to build the inputs/outputs using the node.params
list.
The bug is that by then the GT pipeline might have culled the parameter that existed in 1/ because it's unused. But 1/ promised to the system this parameter. Meanwhile, because 2/ listens to the OIR the parameter doesn't exist: bug.
To fix we need to have 1/ and 2/ agree. The issue is 1/ doesn't know what kind of operation will happen next and 2/ is logically looking at the "truth" coming out of the pipeline, neither one are wrong per se or have access to the right info.
One way to deal with would be to try and deactivate the optimization pass (skip
attribute here] that culls unused parameters, but this might have side effect on performance. A better fix is to see if 1/ or 2/ can be changed without breaking both stencil and orchestration pipeline.