pyiron/pyiron_atomistics

Files are not written for remote jobs

Leimeroth opened this issue · 20 comments

When trying to submit Lammps jobs to a remote cluster only a .h5 file is created, but no input files or working directory. I guess somewhere during restructuring of run functions the necessary call to write_input has gone missing.

EDIT:
For VASP it works, so the issue seems to be in the Lammps class.

Can you try to call job.validate_ready_to_run() before submitting the job and check if that solves the issue?

job.validate_ready_to_run() does not seem to change the behavior.
Manually doing

os.makedirs(job.working_directory)
job.write_input()

seems to do the job.

For potential that are manually defined via a dataframe the write_input_files_from_input_dict functionality breaks the
remote setup because the filepath of the potential does not exist on the remote cluster.

Traceback (most recent call last):
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/__main__.py", line 3, in <module>
    main()
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/control.py", line 61, in main
    args.cli(args)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/wrapper.py", line 37, in main
    job_wrapper_function(
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 186, in job_wrapper_function
    job.run()
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 131, in run
    self.job.run_static()
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/generic.py", line 917, in run_static
    execute_job_with_calculate_function(job=self)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 720, in wrapper
    output = func(job)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 978, in execute_job_with_calculate_function
    ) = job.get_calculate_function()(**job.calculate_kwargs)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 135, in __call__
    self.write_input_funct(
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 80, in write_input_files_from_input_dict
    shutil.copy(source, os.path.join(working_directory, file_name))
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/shutil.py", line 417, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/nfshome/leimeroth/MTP/AlCuZr/Fractions2//10/14//output.14.mtp'

Edit: I use https://github.com/pyiron/pyiron_atomistics/tree/workaround-file-copying as a workaround on the remote hpc right now.
As far as I understand the idea of the new workflow is to copy only the hdf5 and write all necessary files on the remote machine, is this correct? If yes, I guess it is necessary to make an exception for potentials that are not part of the default data repository somehow. Also I am somewhat afraid of issues arising due to different pyiron versions/branches/whatever when only writing the files on the remote machine.

bump

Can you be a bit more specific where the potential file is located /nfshome/leimeroth/MTP/AlCuZr/Fractions2//10/14//output.14.mtp is this on the cluster or on the local workstation?

This is the full local path

Regarding file writing I guess the problem is

    def _check_if_input_should_be_written(self):
        if self._job_with_calculate_function:
            return False
        else:
            return not (
                self.server.run_mode.interactive
                or self.server.run_mode.interactive_non_modal

always returning False for lammps, so that

def save(self):
        """
        Save the object, by writing the content to the HDF5 file and storing an entry in the database.

        Returns:
            (int): Job ID stored in the database
        """
        self.to_hdf()
        if not state.database.database_is_disabled:
            job_id = self.project.db.add_item_dict(self.db_entry())
            self._job_id = job_id
            _write_hdf(
                hdf_filehandle=self.project_hdf5.file_name,
                data=job_id,
                h5_path=self.job_name + "/job_id",
                overwrite="update",
            )
            self.refresh_job_status()
        else:
            job_id = self.job_name
        if self._check_if_input_should_be_written():
            self.project_hdf5.create_working_directory()
            self.write_input()
        self.status.created = True
        print(
            "The job "
            + self.job_name
            + " was saved and received the ID: "
            + str(job_id)
        )
        return job_id

does never call write_input

Just as a workaround, can you check if it works by setting:

job._job_with_calculate_function = False

With job._job_with_calculate_function = False the input and an additional WARNING_pyiron_modified_content file are written.

With job._job_with_calculate_function = False the input and an additional WARNING_pyiron_modified_content file are written.

Does the remote submission work when job._job_with_calculate_function = False is set?

Yes, the job is submitted and runs.

EDIT:
The job runs and finished on the cluster. However, retrieving it with pr.update_from_remote() changes their status to initialized instead of finished locally.

As the issue is not part of the Lammps class itself, I am confused why it works with VASP

Yes, the job is submitted and runs.

EDIT: The job runs and finished on the cluster. However, retrieving it with pr.update_from_remote() changes their status to initialized instead of finished locally.

Ok, an alternative suggestion would be to add the write_input() call before the remote submission. I tried it in pyiron/pyiron_base#1511 but have not tested it so far.

As the issue is not part of the Lammps class itself, I am confused why it works with VASP

I do not know yet. We had another bug with how restart files are read pyiron/pyiron_base#1509 but that is still work in progress.

Yes, the job is submitted and runs.
EDIT: The job runs and finished on the cluster. However, retrieving it with pr.update_from_remote() changes their status to initialized instead of finished locally.

Ok, an alternative suggestion would be to add the write_input() call before the remote submission. I tried it in pyiron/pyiron_base#1511 but have not tested it so far.

Works with the addition of job.project_hdf5.create_working_directory(). Here the warning file is not created

Works with the addition of job.project_hdf5.create_working_directory(). Here the warning file is not created

Great, I think that is the best solution, until we have https://github.com/pyiron/pympipool ready to handle the remote submission.

Do you have an idea how to fix the issue of potentials that are not part of the resources dataframe.

Do you have an idea how to fix the issue of potentials that are not part of the resources dataframe.

I would modify the potential data frame, and maybe just attach the potential as restart file.

@niklassiemer I close this issue, feel free to reopen it if the issue comes up again.

Probably wrong ping @Leimeroth