LLNL/maestrowf

jobs are "INITIALIZED" but not starting

BenWibking opened this issue · 3 comments

I have a test study running on my laptop that appears stuck in this state:

===================================================================================================================================================================================
Step Name             Job ID    Workspace             State        Run Time        Elapsed Time    Start Time           Submit Time          End Time               Number Restarts
--------------------  --------  --------------------  -----------  --------------  --------------  -------------------  -------------------  -------------------  -----------------
generate-profile_0.1  26102     generate-profile/0.1  FINISHED     0d:00h:00m:02s  0d:00h:00m:02s  2024-01-29 13:54:55  2024-01-29 13:54:55  2024-01-29 13:54:57                  0
generate-profile_0.3  26108     generate-profile/0.3  FINISHED     0d:00h:00m:02s  0d:00h:00m:02s  2024-01-29 13:54:57  2024-01-29 13:54:57  2024-01-29 13:54:59                  0
generate-profile_1.0  26111     generate-profile/1.0  FINISHED     0d:00h:00m:02s  0d:00h:00m:02s  2024-01-29 13:54:59  2024-01-29 13:54:59  2024-01-29 13:55:01                  0
generate-infile_0.1   26303     generate-infile/0.1   FINISHED     0d:00h:00m:02s  0d:00h:00m:02s  2024-01-29 13:56:01  2024-01-29 13:56:01  2024-01-29 13:56:03                  0
generate-infile_0.3   26322     generate-infile/0.3   FINISHED     0d:00h:00m:01s  0d:00h:00m:01s  2024-01-29 13:56:03  2024-01-29 13:56:03  2024-01-29 13:56:04                  0
generate-infile_1.0   26340     generate-infile/1.0   FINISHED     0d:00h:00m:02s  0d:00h:00m:02s  2024-01-29 13:56:04  2024-01-29 13:56:04  2024-01-29 13:56:06                  0
run-sim_0.1           --        run-sim/0.1           INITIALIZED  --:--:--        --:--:--        --                   --                   --                                   0
run-sim_0.3           --        run-sim/0.3           INITIALIZED  --:--:--        --:--:--        --                   --                   --                                   0
run-sim_1.0           --        run-sim/1.0           INITIALIZED  --:--:--        --:--:--        --                   --                   --                                   0
===================================================================================================================================================================================

The subdirectories for run-sim_0.1, run-sim_0.3, and run-sim_1.0 don't have any files in them, except for the subdirectory for run-sim_0.1, which has a bash script that was generated from the workflow.

Is there any way to figure out what it's doing and why it appears to be stuck?

top shows that the simulation correspoinding to run-sim_0.1 is running.

Is there some output buffering that would explain why I don't see any log files?

Unfortunately, that's currently expected i think for the local adapter. That one currently appears to run in a blocking manner and waits for the subprocess (steps' bash script) to finish before it writes out the .out/.err log files. We do plan to unblock that with an executor backend to make it behave like the HPC adapters, but that's currently only in a dev branch at the moment.

Thanks for the explanation and quick reply.