core-unit-bioinformatics/cubi-tools

Error while running update_metadata.py with dry-run option

Closed this issue · 6 comments

This command:
./update_metadata.py --project-dir ~/path_to_my_directory/ -d

It errors out with:

Traceback (most recent call last):
File "/home/mirela/Documents/Cubi_tools/cubi-tools/cubi-tools/prototypes/./update_metadata.py", line 565, in
main()
File "/home/mirela/Documents/Cubi_tools/cubi-tools/cubi-tools/prototypes/./update_metadata.py", line 60, in main
clone(workflow_dir, project_dir, ref_repo, source, metadata_dir, dryrun)
File "/home/mirela/Documents/Cubi_tools/cubi-tools/cubi-tools/prototypes/./update_metadata.py", line 181, in clone
raise NameError(
NameError: The 'template-metadata-files' repo needs to be present in the parental folder of the project directory /home/mirela/Documents/2024.04.04_P6_Cisplatin_Hoffmann/project-run-cisplatin.
In a live run the 'template-metadata-files' repo would be created at /home/mirela/Documents/2024.04.04_P6_Cisplatin_Hoffmann/template-metadata-files.

Without dry-run, the command is executed correctly.

This is actually the desired behavior.
My/our assumption was that you keep all your live project folders in a general location where you also keep the CUBI templates. With this approach you only have to download the (updated) templates once and use this to update all projects.
However, this is not mandatory, but then the folder 'template-metadata-files' will be created/cloned always parallel to the project you want to update.
In a dry-run the script is not downloading/copying/modifying any files (hence dry-run) and since it can't find the folder 'template-metadata-files' in the folder '/home/mirela/Documents/2024.04.04_P6_Cisplatin_Hoffmann' it throws this error. However, at the end of the error message it tells you, that "In a live run the 'template-metadata-files' repo would be created at /home/mirela/Documents/2024.04.04_P6_Cisplatin_Hoffmann/template-metadata-files." which apparently happened when you ran the command without the --dry-run option.
If you run the command now again with the --dry-run option you shouldn't get an error message since the folder 'template-metadata-files' is present and the on-screen message should be "The requested branch/version tag (default: main) is present and is getting updated via 'git pull -all'"

The situation is detectable, i.e. it's a dry run and the metadata repo is not detected, which should result in a more readable/comprehensible error message:

  1. do you have the metadata repo checked out locally? yes, then please specify it's location via the --ref-repo parameter
  2. ... no, then either please clone it next to the project folder (default location for auto-detection) or run the script w/o dry run

In case users prefer to put the template folder(s) in a certain folder/location and use that for the update I added a new option called "local-ref".

Now the behavior of the script is as follows:
If no "local-ref" path is provided (not mandatory, default=False) the 'template-metadata-files' repo will be created parallel to the working directory (like it was before).
If it's a dry run and the folder doesn't exist the error message looks like this:

The location of 'template-metadata-files' repo either 
- needs to be parallel to the project directory {working_dir} or 
- provided via the option --local-ref.
In a live run the 'template-metadata-files' repo would be created at {working_dir}/template-metadata-files.

If a "local-ref" path is provided and this folder exists the user is informed that he/she selected the local folder {local_ref} and is asked if he/she wants to update the local folder to the selected {branch} version. If the user says no the script stops with the message "You selected to stop the update process!". Otherwise the local folder is getting updated (if it's a git repository via git pull -all) and the script continues as before.

I do not understand why this change/fix required introducing yet another command line parameter. Conceptually, what is the difference between --ref-repo and --local-ref?

Conceptually, there is a difference between --ref-repo and --local-ref in the current script, but we can discuss if it's worth having the additional command line parameter --local-ref.
--ref-repo is the web address of the remote repository used to clone/update files from. Initially we assumed that the users have all their project repositories in one location and we would have/create the template-metadata-files or template-snakemake folder parallel to these project repositories. If you run the update scripts the template folder is either being created or updated with the latest version of the desired branch/tag (main is the default) and then the target project repository is updated/changed with those files.
--local-ref is the path to a local folder that contains the metadata files that can be wherever the user wants. If provided, this folder is first getting updated (via git pull --all) with the desired branch/tag and then this folder is used to update the target project repository.
If we do it the way you suggested by specifying the local location via the --ref-repo parameter, we have to rely on the user to actively first checkout the latest metadata repo version and the desired branch/tag before starting the update script which I think is risky. Updating the local template-metadata folder that is not parallel to the other project repos won't work then through the script.
The question now is do we give the users the option to have a local template-metadata wherever they like, but then we need the --local-ref parameter to assure that the latest version is used to update or do we insist have/clone the template-metadata folder parallel to the target project repo then we can drop the --local-ref parameter.

I reversed all changes that were made in the last week and just made the error message more readable/comprehensible and added checks to allow a local repo to be used as the template via --ref-repo.