Matgenix/jobflow-remote

Successful Response not being written to jfremote_out.json

Closed this issue · 3 comments

Hi devs,

Issue Description

I have recently starting encountering this issue where my jobflow jobs will run fully and output TaskDocs will be created correctly, but jobflow-remote doesn't write the response to the jfremote_out.json.

This problem initially occurred in the user-workstation setup. I received this error when inspecting failing jobs:
image

I checked the directories on the remote machine and found very sparsely populated jfremote_out.json files.

I tried to run the job how I think jobflow is doing it directly on an interactive compute node with the following code:

from jobflow_remote.jobs.run import run_remote_job
run_dir = /path/to/jfremote_in.json
from atomate2 import SETTINGS
SETTINGS.JDFTX_CMD = ("source /global/homes/c/cote3804/bin/jobflow/jdftx_env.sh && "
"srun -n 1 /path/to/jdftx_gpu") # command for running JDFTx and sourcing env. vars.

run_remote_job(run_dir=run_dir)

This code runs successfully, but still writes the same sparse jfremote_out.json

I've attached the jfremote_in.json file if anyone wants to run the code themselves. It will require using our forks of pymatgen and atomate2.

I've gone through debugging and it seems like the issue occurs on line 63 of jobflow-remote.jobs.run.py in which the output response is set to None before the file is written. I have not changed my jobflow-remote build in a while, so I'm not sure why this issue would pop up now.

Build Details

running on NERSC Perlmutter
jobflow 0.1.18
jobflow-remote 0.1.4
pymatgen and atomate2 are installed in editable mode from the forks above.
python 3.12.4

Thanks!

Hi @cote3804,

thanks for reporting this issue with all the details.

Let me first clarify one point: the fact that the jfremote_out.json is sparsely populated is correct. Aside from the data that you see, it could contain values for detour, addition and replace, but at a first look it seems that these are not generated by your Job. On the other hand output will always be null in this file. This is because the output is already stored in another file that represent the JobStore.

Now, to come to the error, this looks a bit tricky. Your job was executed in the remote machine in the folder defined as run_dir in your screenshot. As you confirmed the jfremote_out.json existed there. If it was not existing you would have gotten a different error during the download phase (something like "Remote error: file /run/path/jfremote_out.json for job
UUID does not exist") and it would have ended up in a REMOTE_ERROR state. So, this should mean that the file was downloaded successfully, or at least that fabric did not raise any error even if the transfer was not successful.
But then the Runner goes on trying to complete the job and claims that the file does not exist in the folder where it should have been downloaded (i.e. /home/coopy/.jfremote/...). The error that you get comes from here:

if not out_path.exists():
msg = (
f"The output file {OUT_FILENAME} was not present in the download "
f"folder {local_path} and it is required to complete the job"
)

where the code only checks for the existence of the file. Has not even attempted to parse it yet.

I can imagine a few possible cases for how you get to this point:

  1. the file was downloaded, but for some reason jobflow-remote does not find it (wrong path, file is deleted for some reason in the meanwhile,...)
  2. the file was not downloaded, but no error was raised from the connection.
  3. you have two runner processes pointing to the same queue DB, but being executed on different machines.

Option one seems quite unlikely, unless you or something else altered its content. So I would rather expect it being one of the other two.
Of course there may be other explanations, but these seem the most likely reasons.

A few points that could help with figuring out the source:

  • Do you confirm that the perlmutter worker is of the remote type? (there could be other potential ways of ending up in that error if the worker is a local one, or at least it would rule out a connection issue)
  • Can you check the content of the /home/coopy/.jfremote/IrO2/download/... folder mentioned in the error message? Does it contain the jfremote_out.json? And any other file (it should also contain at least a remote_job_data.jsonfile)
  • is your home on a standard workstation? Or could it be mounted on some NFS?
  • Did you recently switch to a different machine? Created any other project? Or made any other such test that could lead to have two runners pointing to the same DB?

Hi @gpetretto

Issue number three is exactly the problem because I'm running two user machines and the second one indeed has the correct file path with a jfremote_out.json. It also has a remote_job_data.json file that seems to have the correct TaskDoc. After shutting down the second runner, the issue seems fully resolved.

Thanks!

I am glad that you managed to fix the issue.

Just as a note, from the next jobflow-remote version starting a runner will also add some information to the DB and prevent the user from starting a Runner on different machines at the same time. This should help preventing this kind of issues.