NOAA-GFDL/NDSL

DaCe Orchestration issues

Opened this issue · 1 comments

Describe the bug
DaCe orchestration failing when unused variables in pySHiELD microphysics module parsed. Specifically the sedimentation and icloud methods were passed unused variables, d0_vap, lv00, and cracw which would result in a ValueError. Variables have been since removed in PR 15

To Reproduce
Replace d0_vap and lv00 into sedimentation method and calls, and cracw into icloud method and calls. Make sure to use a dace backend, with the FV3_DACEMODE environment variable set to BuildAndRun to enable orchestration and running of the model in pace. To start run use:
mpirun -n X python -m pace.run <config yaml>

Expected behavior
DaCe should be able to use orchestration regardless of variable usage.

System Environment
Describe the system environment, include:

  • OS: RHEL 8.9
  • Backend used: dace:cpu
  • Environment variables set: FV3_DACEMODE=BuildAndRun
  • Compiler(s): gcc/12.3.0, python/3.11,
  • MPI type, and version: openmpi/5.0.0
  • netCDF Version: netcdf/4.9.2
  • If this bug came from a model run, which model: baroclinic_c12_orch_cpu.yaml

Another case popped up of the same issue: unused variables create a parsing issue in orch:dace:X.

Digging the issue lives in the gt4py/dace bridge.

In orchestration the StencilFactory uses a lazy_stencil (code) to defer build at JIT time. This system refers to the DaCeLazyStencil (code).

1/ When DaCe takes over to move the code under the SDFG IR, the __sdfg__ function gets executed. The stencil get packed into an unexpanded version which signature is given here as a simple list of the declared arguments.

2/ Later on as orchestration progresses, DaCe will turn the OIR (GT last IR) into SDFG (here). The bug manifests here where the bridge attempts to build the inputs/outputs using the node.params list.

The bug is that by then the GT pipeline might have culled the parameter that existed in 1/ because it's unused. But 1/ promised to the system this parameter. Meanwhile, because 2/ listens to the OIR the parameter doesn't exist: bug.

To fix we need to have 1/ and 2/ agree. The issue is 1/ doesn't know what kind of operation will happen next and 2/ is logically looking at the "truth" coming out of the pipeline, neither one are wrong per se or have access to the right info.

One way to deal with would be to try and deactivate the optimization pass (skip attribute here] that culls unused parameters, but this might have side effect on performance. A better fix is to see if 1/ or 2/ can be changed without breaking both stencil and orchestration pipeline.