Mismatched dumped taskdoc and database state
Opened this issue · 14 comments
Hi Devs,
I'm working on my own pymatgen and atomate2 forks to implement additional parsed fields for FHI-AIMS calculations. I added an additional field to the aims CalculationOutput.
class CalculationOutput(BaseModel):
total_time: float = Field(
None, description="Total wall time for calculation"
)
I can see in the remote directory created on the worker that the value is parse correctly and written to the remote_job_data.json
file.
remote_job_data.json
However, when I query the database, that particular field is not present. The states of both pymatgen and atomate2 are consistent on the runner and worker machines. Is there a good way to debug this issue? I assume the error is happening during some runner communication and I'm not quite sure how to go about analyzing the problem.
Thanks!
- Can you let the original versions of these tools work correctly?
- Can you let your patched versions of these tools work correctly on one host?
Hi @cote3804,
this is a somewhat unexpected error. In fact, to avoid any pointless serialization/deserialization, the content of the remote_job_data.json
is basically just read and trasferred directly to the output Store:
This could be a good place to check if the information that is being inserted actually contains the new field.
As an additional test, you can also set the delete_tmp_folder
runner setting to False
(
remote_job_data.json
corresponds to the correct one from the worker.Hi @gpetretto
Thanks for the explanation of where to look in the code and how to store the documents on the runner machine. After some debugging by running locally on the worker, it turns out the atomate2 code was the issue. After loading in the json there was some additional post-processing that wasn't storing the new field in the doc that was uploaded to the database. I was under the impression that the remote_job_data.json
has the exact same structure as what's written to the database, although that doesn't appear to be true.
Thanks for the help!
Hi @cote3804,
good to hear that you managed to fix your issue and thanks for reporting the results of your investigation. Still, my intention was to avoid passing thorugh the processes of (de)serialization as much as possible, but then it does not seem to be like this. Based on your tests, can you provide me more details about the point where atomate2 code was invoked during the completion of the job? Maybe the process can be optimized compared to how it is now.
When I was testing on the worker machine I would use run_locally
through jobflow. The response returned by the job was the task doc created here in the atomate2 code:
https://github.com/materialsproject/atomate2/blob/5cccaa2a12e2ffb4a4f2269c9880b25ae932a0be/src/atomate2/aims/schemas/task.py#L456
Is that what you wanted? I'm not sure I fully understand your question. Part of my confusion is that I don't understand where jobflow-remote is calling the atomate2 code. I assumed it was all on the remote worker and that once the remote_job_data.json was moved back to the runner machine, it would be directly uploaded to the database.
Hi @cote3804, sorry for the delay and for not being very clear in the previous message.
Indeed in principle the atomate2 code should not need to be called on the runner machine during the insertion in the DB, but it might be that some part of the code does some automatic deserialization that I have missed.
Since in the previous message you said that
After loading in the json there was some additional post-processing that wasn't storing the new field in the doc that was uploaded to the database
did you manage to locate the part of the code in jobflow-remote where this additional post-processing is invoked?
If not, can you point to the change in the atomate2 code that caused the initial problem and how you solved it? Maybe I will be able to create a minimal example and reproduce the issue.
Thanks!
Hi @gpetretto
Coming back to this after a few days I've realized that the issue is fully resolved and there's no unexpected post-processing being done by jobflow-remote on the remote_job_data.json
. On all of the functioning calculations I've run over the last few days, the serialized job data matches the TaskDoc
structure I expect.
Unrelatedly, is there an open issue or discussion on batching of jobs on one sbatch
call? For example, could I divide up 2 gpus on one compute node to run two AIMS calculations in parallel on one job submission with jobflow-remote? I'm running workflows that currently bombard the scheduler with hundreds of sbatch
es and I wonder if there's a way to bundle many jobs on one sbatch
call other than by writing a custom atomate2/jobflow job.
Thanks!
Good to hear the issue is solved.
Coming to your other question: yes, there is a way to run several jobflow Jobs in a single slurm job. In the released version it is only possible to run multiple jobflow jobs sequentially in a single slurm job. In the current develop branch it is also possible to run multiple jobflow jobs in parallel at the same time (on top of the sequential execution). If you are going to use that I would suggest using the development branch, because I also fixed a couple of bugs related to the batch submission. The updated documentation is here: https://github.com/Matgenix/jobflow-remote/blob/develop/doc/source/user/advancedoptions.rst#batch-submission
I warn you that this has not been used extensively, but if you try it I would be glad to get your feedback.
In case you are using the 0.1.4 version and switch to the development branch, you will likely encounter an issue with the upgrade procedure and here is the explanation of how to deal with it: #211 (comment)
Wow you all have built quite an excellent tool! I'll happily provide feedback as I use it.
In terms of specifying resources for individual jobs being run in parallel in a batch (i.e. specifying srun -N 1
for two jobs running on two nodes), do you think this should be done by jobflow-remote or should the user need to set up the resource partitioning in jobflow? Please point me to the relevant issue if this is being discussed elsewhere.
Jobflow-remote does not have direct control on how the jobs are being executed. I think it could be possible to set the number of cores/nodes available in a place accessible from inside the jobflow Job, but this means that the Job implementation would be strictly bound to jobflow-remote as a job manager.
To remain general it is possible to set this in jobflow directly. For example, in atomate it is possible to set it through the environmental variables. E.g. setting atomate2_VASP_CMD
to "srun --nodes 1 -n 24 --exclusive vasp_std"
. In jobflow-remote this could be done using an exec_config
. An example of exec_config to add in the configuration could be:
exec_config:
multi_conf_1n_24p:
modules:
- ...
export:
atomate2_VASP_CMD: '"srun --nodes 1 -n 24 --exclusive vasp_std"'
I suppose you could do the same for FHI-AIMS.
You've proposed a good solution. Using exec_config
to set environment variables is an appropriate method. For example:
exec_config:
multi_conf_1n_24p:
modules:
- ...
export:
ATOMATE2_VASP_CMD: '"srun --nodes 1 -n 24 --exclusive vasp_std"'
Regarding case sensitivity:
Based on the following source code snippet:
werner@x13dai-t:~/Public/repo/github.com/materialsproject/atomate2.git$ cd "/home/werner/Public/repo/github.com/materialsproject/atomate2.git" && ug -I -A5 -B5 _ENV_PREFIX
src/atomate2/settings.py
8-
9- from pydantic import Field, model_validator
10- from pydantic_settings import BaseSettings, SettingsConfigDict
11-
12- _DEFAULT_CONFIG_FILE_PATH = "~/.atomate2.yaml"
13: _ENV_PREFIX = "atomate2_"
14-
15-
16- class Atomate2Settings(BaseSettings):
17- """
18- Settings for atomate2.
--
213- )
214- ABINIT_MAX_RESTARTS: int = Field(
215- 5, description="Maximum number of restarts of a job."
216- )
217-
218: model_config = SettingsConfigDict(env_prefix=_ENV_PREFIX)
219-
220- # QChem specific settings
221-
222- QCHEM_CMD: str = Field(
223- "qchem", description="Command to run standard version of qchem."
--
254- This allows setting of the config file path through environment variables.
255- """
256- from monty.serialization import loadfn
257-
258- config_file_path = values.get(key := "CONFIG_FILE", _DEFAULT_CONFIG_FILE_PATH)
259: env_var_name = f"{_ENV_PREFIX.upper()}{key}"
260- config_file_path = Path(config_file_path).expanduser()
261-
262- new_values = {}
263- if config_file_path.exists():
264- if config_file_path.stat().st_size == 0:
tests/common/test_settings.py
1- from pathlib import Path
2-
3- import pytest
4- from pydantic import ValidationError
5-
6: from atomate2.settings import _DEFAULT_CONFIG_FILE_PATH, _ENV_PREFIX, Atomate2Settings
7-
8-
9- def test_empty_and_invalid_config_file(
10- clean_dir, monkeypatch: pytest.MonkeyPatch, capsys: pytest.CaptureFixture[str]
11- ):
12- # test no warning if config path is default and file does not exist
13: env_var_name = f"{_ENV_PREFIX.upper()}CONFIG_FILE"
14- monkeypatch.delenv(env_var_name, raising=False)
15- settings = Atomate2Settings()
16- assert settings.CONFIG_FILE == _DEFAULT_CONFIG_FILE_PATH
17- stdout, stderr = capsys.readouterr()
18- assert stdout == ""
We can see and summarize that:
- When setting environment variables directly, uppercase is recommended (e.g.,
ATOMATE2_VASP_CMD
). - In configuration files, both cases work, but for consistency, it's advisable to follow the example above using uppercase.
This approach maintains compatibility with atomate2 while allowing flexible resource configuration in jobflow-remote.
See here and here for the related documentation and discussion.
Based on https://github.com/materialsproject/atomate2/blob/main/tests/common/test_settings.py, I gave a more comprehensive test, as shown below:
(datasci) werner@x13dai-t:~/test$ cat test_atomate2_settings.py
from pathlib import Path
import pytest
from pydantic import ValidationError
from atomate2.settings import (
_DEFAULT_CONFIG_FILE_PATH,
_ENV_PREFIX,
Atomate2Settings,
)
def test_empty_and_invalid_config_file(
clean_dir, monkeypatch: pytest.MonkeyPatch, capsys: pytest.CaptureFixture[str]
):
# test no warning if config path is default and file does not exist
env_var_name = f"{_ENV_PREFIX.upper()}CONFIG_FILE"
monkeypatch.delenv(env_var_name, raising=False)
settings = Atomate2Settings()
assert settings.CONFIG_FILE == _DEFAULT_CONFIG_FILE_PATH
stdout, stderr = capsys.readouterr()
assert stdout == ""
assert stderr == ""
# set path to load settings from with ATOMATE2_CONFIG_FILE env variable
config_file_path = Path.cwd() / "test-atomate2-config.yaml"
monkeypatch.setenv(env_var_name, str(config_file_path))
settings = Atomate2Settings()
assert str(config_file_path) == settings.CONFIG_FILE
assert settings.SYMPREC == 0.1
assert settings.BANDGAP_TOL == 1e-4
assert settings.VASP_RUN_BADER is False
assert settings.VASP_RUN_DDEC6 is False
assert settings.DDEC6_ATOMIC_DENSITIES_DIR is None
# test warning if config file exists but is empty
config_file_path.touch()
with pytest.warns(UserWarning, match=f"Using {env_var_name} at .+ but it's empty"):
Atomate2Settings()
# test error if the file exists and contains invalid YAML
with open(config_file_path, "w") as file:
file.write("invalid yaml")
with pytest.raises(SyntaxError, match=f"{env_var_name} at"):
Atomate2Settings()
# test error if the file exists and contains invalid settings
with open(config_file_path, "w") as file:
file.write("VASP_CMD: 42")
with pytest.raises(
ValidationError,
match="1 validation error for Atomate2Settings\nVASP_CMD\n "
"Input should be a valid string ",
):
Atomate2Settings()
# another invalid setting
with open(config_file_path, "w") as file:
file.write("BANDGAP_TOL: invalid")
with pytest.raises(
ValidationError,
match="1 validation error for Atomate2Settings\nBANDGAP_TOL\n "
"Input should be a valid number",
):
Atomate2Settings()
# test warning if config path is non-default and file does not exist
config_file_path.unlink()
with pytest.warns(UserWarning, match=f"{env_var_name} at .+ does not exist"):
Atomate2Settings()
@pytest.fixture
def clean_dir(tmp_path):
"""Provide a clean directory for testing."""
from pathlib import Path
import os
original_dir = Path.cwd()
os.chdir(tmp_path)
yield tmp_path
os.chdir(original_dir)
@pytest.fixture
def config_file(tmp_path):
config = tmp_path / "test-atomate2-config.yaml"
config.write_text("""
VASP_CMD: vasp_from_config
SYMPREC: 0.2
BANDGAP_TOL: 0.002
VASP_RUN_BADER: false
CP2K_CMD: cp2k_from_config
""")
return config
def test_environment_variables_override_config_file(monkeypatch, config_file):
env_var_name = f"{_ENV_PREFIX.upper()}CONFIG_FILE"
monkeypatch.setenv(env_var_name, str(config_file))
monkeypatch.setenv(f"{_ENV_PREFIX.upper()}VASP_CMD", "vasp_from_env")
monkeypatch.setenv(f"{_ENV_PREFIX.upper()}SYMPREC", "0.3")
monkeypatch.setenv(f"{_ENV_PREFIX.upper()}BANDGAP_TOL", "0.003")
monkeypatch.setenv(f"{_ENV_PREFIX.upper()}VASP_RUN_BADER", "true")
settings = Atomate2Settings()
# Check that the config file is being used
assert settings.CONFIG_FILE == str(config_file), "Config file path should be correctly set"
# Check priorities
assert settings.VASP_CMD == "vasp_from_env", "Environment variable should override config file"
assert settings.SYMPREC == 0.3, "Environment variable should override config file"
assert settings.BANDGAP_TOL == 0.003, "Environment variable should override config file"
assert settings.VASP_RUN_BADER is True, "Environment variable should override config file"
assert settings.CP2K_CMD == "cp2k_from_config", "Config file value should be used when no environment variable is set"
def test_default_values_when_not_set(monkeypatch, tmp_path):
empty_config = tmp_path / "empty_config.yaml"
empty_config.touch()
env_var_name = f"{_ENV_PREFIX.upper()}CONFIG_FILE"
monkeypatch.setenv(env_var_name, str(empty_config))
settings = Atomate2Settings()
# Check that the empty config file is being used
assert settings.CONFIG_FILE == str(empty_config), "Empty config file path should be correctly set"
# Check default values
assert settings.VASP_CMD == "vasp_std", "Should use default value when not set in config or environment"
assert settings.SYMPREC == 0.1, "Should use default value when not set in config or environment"
assert settings.BANDGAP_TOL == 0.0001, "Should use default value when not set in config or environment"
assert settings.VASP_RUN_BADER is False, "Should use default value when not set in config or environment"
assert settings.CP2K_CMD == "cp2k.psmp", "Should use default value when not set in config or environment"
(datasci) werner@x13dai-t:~/test$ pytest test_atomate2_settings.py -v
======================================================================== test session starts =========================================================================
platform linux -- Python 3.11.1, pytest-8.3.3, pluggy-1.5.0 -- /home/werner/.pyenv/versions/3.11.1/envs/datasci/bin/python3
cachedir: .pytest_cache
rootdir: /home/werner/test
plugins: split-0.10.0, dash-2.16.1, nbmake-1.5.4, anyio-4.6.2.post1, cov-6.0.0, mock-3.14.0
collected 3 items
test_atomate2_settings.py::test_empty_and_invalid_config_file PASSED [ 33%]
test_atomate2_settings.py::test_environment_variables_override_config_file PASSED [ 66%]
test_atomate2_settings.py::test_default_values_when_not_set PASSED [100%]
========================================================================== warnings summary ==========================================================================
test_atomate2_settings.py::test_empty_and_invalid_config_file
/home/werner/.pyenv/versions/3.11.1/envs/datasci/lib/python3.11/site-packages/pydantic/main.py:214: UserWarning: ATOMATE2_CONFIG_FILE at /tmp/pytest-of-werner/pytest-4/test_empty_and_invalid_config_0/test-atomate2-config.yaml does not exist
validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
test_atomate2_settings.py::test_default_values_when_not_set
/home/werner/.pyenv/versions/3.11.1/envs/datasci/lib/python3.11/site-packages/pydantic/main.py:214: UserWarning: Using ATOMATE2_CONFIG_FILE at /tmp/pytest-of-werner/pytest-4/test_default_values_when_not_s0/empty_config.yaml but it's empty
validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================================== 3 passed, 2 warnings in 0.29s ====================================================================
Thanks for the help @hongyi-zhao and @gpetretto. Configuring the srun arguments with the exec_config
environment variables is a nice solution.
I've been testing batch submission and have come across an issue where jobs in the running/
directory timed out before finishing and now the runner is submitting batch jobs that time out with this message:
jobflow_remote.jobs.run: No jobs available for more than 60 seconds. Stopping.
And the runner state of those jobs is stuck in BATCH_RUNNING
even though those jobs are never actually run on the worker.
I was able to copy the files in the running/
directory over to the submitted/
directory and remove the text that follows the index in the filename. As an example,
9494a268-e62a-4228-8d08-7835d8854ce5_1_cb794721-2afe-4617-b294-0e7b1bec4907
-> 9494a268-e62a-4228-8d08-7835d8854ce5_1
The job then ran successfully and the runner machine shows the state updated correctly.
Is there a better way to handle file cleanup and resetting the state of a timed out job or are these features under development right now? This discussion is now quite far from the title of the post so I can open a new issue to discuss if you all think that's better.
Hi @cote3804, thanks a lot for the feedback on the batch submission!
I imagine that the issue that you encountered is likely due to a bug. I am not sure what you mean exactly with
jobs in the running/ directory timed out before finishing
You mean that the SLURM job reached the walltime before the jobflow Job reached the end?
Also, it is not clear why, if there was no Job in the submitted/
folder the runner kept submitting new Jobs. This seems a bug.
If despite fixing the bug situations like the one that you mention keep arising I can provide a to move those files, but maybe this should could be handled in jf job rerun
. (for example I could make things so that if a job is rerun the file is also removed from the running/
folder. I suppose this could have also solved your issue.
Anyway, indeed it would be better to move this discussion to a dedicated issue. Can you open a new one on this topic?