sphuber/aiida-shell

Bug: Default submit script is overridden by `RemoteData`

hujay2019 opened this issue · 8 comments

The error occurs when the following conditions are met:

  1. submit=True. Run a script in a non-blocking way. Then a correct script _aiidasubmit.sh will be written in the working directory.
  2. Add a RemoteData type node as input. If the RemoteData contains the file _aiidasubmit.sh generated by the previous process, this 'wrong' script will override the correct script.

This error can be avoided by defining a different submit script filename (see examples below).
I suggest adding a WARNING to the documentation or changing the default behavior of the `aiida-shell' package to avoid this error.

Error example

results, future = launch_shell_job(
    "date",
    submit=True,
    nodes={"remote_folder": remote_folder},
    metadata={
        "options": {
            "computer": code.computer,
        }
    },
)

Correct example

results, future = launch_shell_job(
    "date",
    submit=True,
    nodes={"remote_folder": remote_folder},
    metadata={
        "options": {
            "computer": code.computer,
            "submit_script_filename": "_aiidasubmit_shell.sh",
        }
    },
)

Thanks for the report and the work-around. The documentation actually does warn about this currently. See the bottom of this section: https://aiida-shell.readthedocs.io/en/latest/howto.html#running-a-shell-command-with-remote-data
If you have any suggestions on how to improve the wording, that would be much appreciated.

I am wondering if instead of recommending this workaround, aiida-shell can do something to prevent this file from being overwritten. Unfortunately, it seems like that might be more tricky than one would think. I am using the CalcInfo.remote_copy_list to have the engine copy over the files to the work directory and those are copied after the contents of the sandbox folder are copied over:

https://github.com/aiidateam/aiida-core/blob/main/aiida/engine/daemon/execmanager.py#L245-L249

The remote_copy_list syntax also does not allow any "exclude" rules, but only paths to include, so we cannot simply instruct the engine to skip _aiidasubmit.sh. Of course we could try to read the contents of the remote and manually add explicit include rules that skip the submit script. However, that requires opening a transport connection to the remote from within the Calcjob.prepare_for_submission method, which is ill-advised, since it can fail and it circumvents the engine's transport connection pooling.

There is actually already an open issue on aiida-core about this problem: aiidateam/aiida-core#6012
I will discuss it with the team whether we can change the order of copying, which would solve this issue, but it is technically backwards-incompatible so I am not sure whether we can get this released soon.

I just had the same issue. Of course one thing is to avoid this to happen altogether, but I think it's a very common use case to just copy a subfolder from a parent remote to a subfolder of the new remote, but this is not possible AFAIK at the moment. People encounter this "bug" or behaviour because it's the first thing you try when you want to try to obtain this goal. Enabling this option already would have people try directly the subdirectory copy rather than a full copy. One could also try to warn, in a new version, if you are copying a RemoteData and don't specify a subfolder? or ask to ignore any files with the "submit_script_filename" name?

In terms of functionality for the remote data, I would suggest allowing to optionally specify a subdirectory or list of files/subdirs to copy from, default being . (with care between copying the content of a folder vs. the folder itself, i.e. with examples for the two cases), and a destination subdir (default . if there is a default mechanism to avoid clashes of stdout/stderr/status/submit script, otherwise the default could be to put in a subfolder? Still, it's still good to think of a mechanism to avoid copying at least those files). There should be a simple example to copy subfolder mydir to a subfolder with the same name in the new job working directory, as I think it is a potentially common use case.

Thanks for the feedback @giovannipizzi . Support for RemoteData nodes as input was only added in the last version, but I clearly didn't think/test it all the way through. I wanted to keep it as simple as possible, which is why I chose for just copying the contents wholesale. I thought about alternatives, but I was afraid that I'd end up with a syntax similar to the remote_copy_list syntax in aiida-core which is not very intuitive.

I just had the same issue. Of course one thing is to avoid this to happen altogether, but I think it's a very common use case to just copy a subfolder from a parent remote to a subfolder of the new remote, but this is not possible AFAIK at the moment.

It might be a use-case to copy a subfolder, but there will be lots of variants: multiple folders/files, different target paths, exclude certain files, etc. I think we should either find a syntax that allows essentially everything in a straightforward way, or it will become a complicated mess of various exceptions and rules that are tacked on as we find new use-cases.

People encounter this "bug" or behaviour because it's the first thing you try when you want to try to obtain this goal. Enabling this option already would have people try directly the subdirectory copy rather than a full copy. One could also try to warn, in a new version, if you are copying a RemoteData and don't specify a subfolder?

I am not so worried about "breaking" compatibility, since I only just added support for remote data very recently and it is clearly broken. If we can find an approach that works, we can just go ahead and replace current implementation entirely.

or ask to ignore any files with the "submit_script_filename" name?

This doesn't work unfortunately because, as I mentioned in the post above, the remote_copy_list syntax of aiida-core currently doesn't allow to have "exclude" rules. You can only have positive matching rules. In addition, the submit script is not the only problem, there are also the stdout and stderr files.

Note that the real immediate problem is that the remote_copy_list is handled by the engine after the conents of the sandbox are copied. I have opened a PR on aiida-core (aiidateam/aiida-core#6285) to make that order configurable. This "fixes" the immediate problem as in that now the submit script of the remote folder no longer overwrites the script of the current job. However, when using symlinks, this will now essentially overwrite the contents of the old directory with the output from the new command. This might cause problems where a RemoteFolder needs to be reused multiple times consecutively. Not using symlinks would be a workaround, but clearly this is not ideal.

Ultimately, I am not sure what the best approach is here. I really want to avoid coming up with a similar syntax as remote_copy_list in aiida-core.

Ultimately, I am not sure what the best approach is here. I really want to avoid coming up with a similar syntax as remote_copy_list in aiida-core.

But in the end maybe it's better to have a simpler version that works in most cases (e.g. copy everything? but there are the issues described above), and still expos the full capability for those who need it? I still think anyway that copying only some files (directly on the remote) is a very important requirement for most usecases.

I still think anyway that copying only some files (directly on the remote) is a very important requirement for most usecases.

I agree.

But in the end maybe it's better to have a simpler version that works in most cases (e.g. copy everything? but there are the issues described above)

That is what is currently implemented. This PR would fix those immediate issues and make it usable in most cases, but it requires a change in aiida-core. See this PR aiidateam/aiida-core#6285 So it would be really useful if you can comment on that.

and still expos the full capability for those who need it?

You mean directly allow users to define remote_copy_list? I guess we could do that. But I think we should have a way to exclude files then as well when doing globbing, which is currently not supported. Maybe we should add this to aiida-core first?

Yes, probably putting this feature in aiida core first is the right thing to do, it's a useful use case

@hujay2019 the fix for this is now released in v0.7.2 Thanks again for reporting