Subworkflows can only use modules present in the same repo
awgymer opened this issue · 8 comments
Description of feature
This is a current (possibly permanent) limitation of subworkflows.
This means you cannot define a subworkflow in an nf-core
structured repo which uses nf-core
modules directly.
The implications to allow this would be greater complexity for updating and installing.
Should be clearly documented.
My team is currently adopting nf-core tools and we've noticed this limitation. I'm interested in working on adding support for 'hybrid' subworkflows. Any guidance on how to begin would be helpful.
This is quite a thorny problem and right now there is no proper solution I am afraid. You could mirror the GitHub modules repo and add your own subworkflows and modules to that but that has its own wrinkles.
I hope we can find a better solution eventually but obviously as an open source project supporting split open-source, in-house work is probably not a priority issue.
I understand... I'll share our solution or workaround as soon as we find one that we are happy with. Thank you
Hi @awgymer and @mberacochea
Thanks for the hints. Here is what I have settled on for now:
- Inside the organisation (XYZ) repo, create
nf-core-modules
directory. Do:
cd nf-core-modules
touch main.nf
touch nextflow.config
cat <<-EOF > .nf-core.yml
repository_type: pipeline
EOF
-
nf-core-modules directory will behave as a pipeline and the nf-core modules can be installed with version control using nf-core tools.
-
Inside the organisation (XYZ) repo, create a
nf-core-hybridisation.sh
to keep track of hybrid modules. Example:
#!/usr/bin/env bash
cp -r ./nf-core-modules/modules/nf-core/gunzip ./modules/nf-core/ # needed for hybrid testing
mkdir -p ./modules/XYZ/cat
cp -r ./nf-core-modules/modules/nf-core/cat/cat ./modules/XYZ/cat # Needed for a hybrid sub-workflow
This way the hybridisation can be version controlled. I am not sure it will work in every situation. Looking forward to your thoughts.
If I understand this correctly you are basically using a "pipeline repo" to mirror modules into your remote and then syncing them with bash then?
This is a little like an idea that has been raised here which would see subworkflows package their modules alongside themselves.
I've only thought about it a little bit, but the idea in my head would be to create a 3rd "repository_type" of "subworkflow". This would mostly behave like a "pipeline" but with a few differences (some assumptions about pipeline repos wouldn't be quite the same).
The tooling could then be refactored to basically do a recursive pass of "subworkflows" updating/installing modules within (or perhaps they should be frozen I'm not sure).
If I understand this correctly you are basically using a "pipeline repo" to mirror modules into your remote and then syncing them with bash then?
Yes, that's true. Essentially I am creating two copies in the same repo. Not ideal. But it is explicit and allows me to use nf-core tools to stay up to date with nf-core/modules. For me, it is really a temporary solution as I intend to eventually contribute all the local org modules and sub-workflows to nf-core/modules.
This is a little like an idea that has been raised here which would see subworkflows package their modules alongside themselves.
I've only thought about it a little bit, but the idea in my head would be to create a 3rd "repository_type" of "subworkflow". This would mostly behave like a "pipeline" but with a few differences (some assumptions about pipeline repos wouldn't be quite the same).
The tooling could then be refactored to basically do a recursive pass of "subworkflows" updating/installing modules within (or perhaps they should be frozen I'm not sure).
Yes, I like the idea of freezing modules inside sub-workflows. When a sub-workflow is downloaded by a pipeline developer, the nf-core tools can generate a warning saying that the sub-workflow modules are outdated. The developer can choose to keep using the outdated modules or create a sub-workflow update pull request which goes through the nf-test Github Actions along with the community review. Does this also prevent the sub-workflow malfunction due to breaking module updates? Or, is that already taken care of by some other mechanism?
We could also have the ability to provide multiple --git-remote
options on the CLI and have some sort of fallback mechanism as to where the appropriate components are sourced? Don't know how the dependencies between modules and subworkflows are currently tracked in tools because this would need to be mirrored in modules.json
somehow.
For example, --git-remote <MYGITHUB_REPO> --git-remote <NF_CORE_MODULES_REPO>
. Tricky thing will be deciding which one takes precedence if you have the same modules in both of these repos, especially if you have more than 2 --git-remote
.
Blasting some ideas out there. What do you think @mashehu @mirpedrol ?
Thank you @drpatelh .
To give a perspective of my case. I developed a subworfklow that uses internal (our nf-core_modules-like repo) and external (public nf-core/modules) modules. When I try to install this module with nf-core install --git-remote <internal nf-core modules URL> ...
, nf-core tools can't find the modules.
What I would suggest is something like pip (https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-extra-index-url). I would add --extra-git-url
or something like this, where the extra adds to what is not found in the --git-remote. This way, the --git-remote would have precedent to the --extra-git-url
This way, we can still use public modules and subworkflows and keep up-to-date with new releases with occasional local patches without the need to internalize modules without the intent to modify them heavily.