Files are not written for remote jobs
Leimeroth opened this issue · 20 comments
When trying to submit Lammps jobs to a remote cluster only a .h5 file is created, but no input files or working directory. I guess somewhere during restructuring of run functions the necessary call to write_input has gone missing.
EDIT:
For VASP it works, so the issue seems to be in the Lammps class.
Can you try to call job.validate_ready_to_run()
before submitting the job and check if that solves the issue?
job.validate_ready_to_run()
does not seem to change the behavior.
Manually doing
os.makedirs(job.working_directory)
job.write_input()
seems to do the job.
For potential that are manually defined via a dataframe the write_input_files_from_input_dict functionality breaks the
remote setup because the filepath of the potential does not exist on the remote cluster.
Traceback (most recent call last):
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/__main__.py", line 3, in <module>
main()
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/control.py", line 61, in main
args.cli(args)
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/wrapper.py", line 37, in main
job_wrapper_function(
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 186, in job_wrapper_function
job.run()
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 131, in run
self.job.run_static()
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/generic.py", line 917, in run_static
execute_job_with_calculate_function(job=self)
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 720, in wrapper
output = func(job)
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 978, in execute_job_with_calculate_function
) = job.get_calculate_function()(**job.calculate_kwargs)
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 135, in __call__
self.write_input_funct(
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 80, in write_input_files_from_input_dict
shutil.copy(source, os.path.join(working_directory, file_name))
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/shutil.py", line 417, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/shutil.py", line 254, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/nfshome/leimeroth/MTP/AlCuZr/Fractions2//10/14//output.14.mtp'
Edit: I use https://github.com/pyiron/pyiron_atomistics/tree/workaround-file-copying as a workaround on the remote hpc right now.
As far as I understand the idea of the new workflow is to copy only the hdf5 and write all necessary files on the remote machine, is this correct? If yes, I guess it is necessary to make an exception for potentials that are not part of the default data repository somehow. Also I am somewhat afraid of issues arising due to different pyiron versions/branches/whatever when only writing the files on the remote machine.
bump
Can you be a bit more specific where the potential file is located /nfshome/leimeroth/MTP/AlCuZr/Fractions2//10/14//output.14.mtp
is this on the cluster or on the local workstation?
This is the full local path
Regarding file writing I guess the problem is
def _check_if_input_should_be_written(self):
if self._job_with_calculate_function:
return False
else:
return not (
self.server.run_mode.interactive
or self.server.run_mode.interactive_non_modal
always returning False for lammps, so that
def save(self):
"""
Save the object, by writing the content to the HDF5 file and storing an entry in the database.
Returns:
(int): Job ID stored in the database
"""
self.to_hdf()
if not state.database.database_is_disabled:
job_id = self.project.db.add_item_dict(self.db_entry())
self._job_id = job_id
_write_hdf(
hdf_filehandle=self.project_hdf5.file_name,
data=job_id,
h5_path=self.job_name + "/job_id",
overwrite="update",
)
self.refresh_job_status()
else:
job_id = self.job_name
if self._check_if_input_should_be_written():
self.project_hdf5.create_working_directory()
self.write_input()
self.status.created = True
print(
"The job "
+ self.job_name
+ " was saved and received the ID: "
+ str(job_id)
)
return job_id
does never call write_input
Just as a workaround, can you check if it works by setting:
job._job_with_calculate_function = False
With job._job_with_calculate_function = False
the input and an additional WARNING_pyiron_modified_content file are written.
With
job._job_with_calculate_function = False
the input and an additional WARNING_pyiron_modified_content file are written.
Does the remote submission work when job._job_with_calculate_function = False
is set?
Yes, the job is submitted and runs.
EDIT:
The job runs and finished on the cluster. However, retrieving it with pr.update_from_remote()
changes their status to initialized instead of finished locally.
As the issue is not part of the Lammps class itself, I am confused why it works with VASP
Yes, the job is submitted and runs.
EDIT: The job runs and finished on the cluster. However, retrieving it with
pr.update_from_remote()
changes their status to initialized instead of finished locally.
Ok, an alternative suggestion would be to add the write_input()
call before the remote submission. I tried it in pyiron/pyiron_base#1511 but have not tested it so far.
As the issue is not part of the Lammps class itself, I am confused why it works with VASP
I do not know yet. We had another bug with how restart files are read pyiron/pyiron_base#1509 but that is still work in progress.
Yes, the job is submitted and runs.
EDIT: The job runs and finished on the cluster. However, retrieving it withpr.update_from_remote()
changes their status to initialized instead of finished locally.Ok, an alternative suggestion would be to add the
write_input()
call before the remote submission. I tried it in pyiron/pyiron_base#1511 but have not tested it so far.
Works with the addition of job.project_hdf5.create_working_directory()
. Here the warning file is not created
Works with the addition of
job.project_hdf5.create_working_directory()
. Here the warning file is not created
Great, I think that is the best solution, until we have https://github.com/pyiron/pympipool ready to handle the remote submission.
Do you have an idea how to fix the issue of potentials that are not part of the resources dataframe.
Do you have an idea how to fix the issue of potentials that are not part of the resources dataframe.
I would modify the potential data frame, and maybe just attach the potential as restart file.
@niklassiemer I close this issue, feel free to reopen it if the issue comes up again.
Probably wrong ping @Leimeroth