job crashes early in hdfio

Question

job crashes early in hdfio

Opened this issue 4 months ago · 10 comments

Summary

A SPHInX (restart) job fails to run due to failures in hdf5io. Error message is "ValueError: Objects can be only recovered from hdf5 if TYPE is given"

I cannot tell if this is related to restart.

pyiron Version and Platform

cmti

Expected Behavior

Job runs.

Actual Behavior

Job crashes.
Job execution crashes with the following error.out

> Traceback (most recent call last):
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/runpy.py", line 196, in _run_module_as_main
>     return _run_code(code, main_globals, None,
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/runpy.py", line 86, in _run_code
>     exec(code, run_globals)
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/cli/__main__.py", line 3, in <module>
>     main()
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/cli/control.py", line 59, in main
>     args.cli(args)
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/cli/wrapper.py", line 37, in main
>     job_wrapper_function(
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 161, in job_wrapper_function
>     job = JobWrapper(
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 64, in __init__
>     self.job = pr.load(int(job_id))
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/project/jobloader.py", line 104, in __call__
>     return super().__call__(job_specifier, convert_to_object=convert_to_object)
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/project/jobloader.py", line 75, in __call__
>     return self._project.load_from_jobpath(
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/project/generic.py", line 1001, in load_from_jobpath
>     job = job.to_object()
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/jobs/job/core.py", line 596, in to_object
>     return self.project_hdf5.to_object(object_type, **qwargs)
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/storage/hdfio.py", line 1142, in to_object
>     return _to_object(self, class_name, **kwargs)
>   File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_latest/lib/python3.10/site-packages/pyiron_base/storage/hdfio.py", line 117, in _to_object
>     raise ValueError("Objects can be only recovered from hdf5 if TYPE is given")
> ValueError: Objects can be only recovered from hdf5 if TYPE is given

Steps to Reproduce

?? Deleting and setting up the job again produces the error again.

Answer 1 · 2024-05-21T19:23:38.000Z

Hm there's not a single line coming from Sphinx in the error message. Do you have a small code to reproduce the error?

Answer 2 · 2024-05-21T19:53:46.000Z

Could it be that there's a stray entry in the database from a time when you deleted the job files manually outside of pyiron?

Answer 3 · 2024-05-21T19:55:36.000Z

Can you also maybe try to see whether a different version of pyiron helps? It might help us figure out which changes could have caused the problem.

Answer 4 · 2024-05-22T12:04:50.000Z

Changing to pyiron/2024-05-20 seemed to help. I was on pyiron/latest before, which apparently is NOT latest. Is it possible that the pyiron version used on the cluster is incompatible with the pyiron/latest on the login node?

This is a VERY frustrating experience I am having here. Loads of incomprehensible warnings. Error messages with zero information value. 'Objects can be only recovered from hdf5 if TYPE is given' is essentially a 'Something error occured'.

I close the ticket, nothing to win here any more.

Answer 5 · 2024-05-22T12:30:37.000Z

Changing to pyiron/2024-05-20 seemed to help. I was on pyiron/latest before, which apparently is NOT latest. Is it possible that the pyiron version used on the cluster is incompatible with the pyiron/latest on the login node?

@niklassiemer Can you comment on this?

Answer 6 · 2024-05-22T13:26:24.000Z

Hmmm to my taste the PR got closed a bit too early. If there are updates I would appreciate you guys to post them here.

Answer 7 · 2024-05-22T17:40:47.000Z

Changing to pyiron/2024-05-20 seemed to help. I was on pyiron/latest before, which apparently is NOT latest. Is it possible that the pyiron version used on the cluster is incompatible with the pyiron/latest on the login node?

@niklassiemer Can you comment on this?

pyiron/latest is indeed after all the hand updated version with python3.10 which was somewhat older than the docker-stack build from yesterday. However, the version on the cluster and the one on the login node should not differ! Actually, the kernel chosen in the notebook should also be loaded on the compute node via preserving of the environment. If this is not the case, I need to know and find a solution!

Answer 8 · 2024-05-24T11:38:53.000Z

Got the problem again, with the new kernel. So it's not about the python kernel.

I solved the problem again. This time, by avoiding minus-sign in the job name. I may have done this last time, too.

Is it possible that the appearance of a minus sign in the job name causes issues? It seems reproducible.
E20Vnm-test - fails in hdfio
E20Vnm_neutral - runs.

Answer 9 · 2024-05-24T11:46:07.000Z

another thought: could be some inconsistency in the name normalization. For hdf5 file '-' seems replaced by m, in the job table, the '-' is still there. In the working directory, it becomes E20Vnmmtest_hdf/E20Vnm-test/ = some mixture.
I got confused by this at some point, that's why I had changed from minus to underscore. Yet, for me, minus is more convenient to type, so high chances I do this again.
Also, when I remove the job via pr.remove_job, the _hdf5 directory stays in place.

Answer 10 · 2024-05-24T11:52:55.000Z

Thanks for coming back to this! This could indeed be a reason! I opened an issue on pyiron_base.