equinor/ert

job_runner classifies successful forward_model as failure

Closed this issue · 3 comments

Describe the bug
When testing Drogon in Azure, the following failure is observed:

realization-0/iter-0]$ cat ERROR
<error>
  <time>10:23:22</time>
  <job>DESIGN_KW</job>
  <reason>The target file:DESIGN_KW.OK has not been updated; this is flagged as failure. mtime:1653294197.0   stat_start_time:1653294197.0</reason>
  <stderr>
<Not written by:DESIGN_KW>
</stderr>
</error>

In this case, DESIGN_KW successfully created the file as it should, but it looks like the filesystem acts too fast for:

if stat.st_mtime > target_file_mtime:

Looking at the status.json, the previous design_kw forward models seems to have completed within ~1ms.

To Reproduce
Steps to reproduce the behavior:
1 Run Drogon in Azure with the Torque driver, here tested with the drogon_design.ert.

Expected behavior
Successful jobs should be classified as such.

Screenshots
If applicable, add screenshots to help explain your problem.

Enviromment

  • OS: RHEL7
  • ERT/Komodo Release: 2.35, komodo-stable
  • 3.8
  • Remote/HPC execution involved: yes

I see two possible solutions

  1. use the nanosecond-version, i.e. stat.st_mtime_ns, see https://docs.python.org/3.6/library/os.html#os.stat_result and in particular the note about resolution

  2. use a hash of the file-content instead of the mtime (we don't use the time itself, only whether the file has changed)

The latter is more robust and my preference, but may require slightly more processing-time.

I'll cook up a patch but someone else needs to test it in Azure.

Still occurs after merging #3428 :

The target file:DESIGN_KW.OK has not been updated; this is flagged as failure. mtime:1654868393.0 stat_start_time:1654868393000000000

Apparently, this issue was solved by removing TARGET_FILE from job configurations as this was the issue for smaller jobs testing they've succeed. For reference: equinor/semeio#431
Closing this one then.