Galaxy Support?
jmchilton opened this issue · 6 comments
Is there a way to do this automatically from a Galaxy tool - I'm not seeing that. I thought that was the goal? Is that something I can help with - I'd like to use galaxy-lib for parsing and finding tool XML files - the abstractions will mean other tool format such as CWL will work automatically then also.
What do you mean with do this
?
mulled is a demonstration project. Our aim is to do this completely automatized - and with the https://github.com/mulled/auto-mulled project we try a few things out. If you want to help here are ideas https://github.com/mulled/auto-mulled/issues
An other idea would be to integrate it directly into bioconda or conda-forge. But I'm hesitant because we would also like to put tarballs on depot - sharing credentials ... and how knows which other conda channel wants mulled support.
Regarding Galaxy integration I'm not sure what you mean. My plan is - and I would love to get help here - to use quay.io API to automatically retrieve the correct mulled container from a given name, version tuple. This way no extra container
attribute is needed inside the requirement
tag.
Does this answer your question?
Firstly - I definitely want to have dynamically resolved container
tags also.
So my idea would be (and I guess I thought mulled did this):
- For every tool X in repository Y:
- Check dependencies of X, if a container has been pushed quary.io or where ever for these dependencies
continue
. - Else, build and push a container for these dependencies.
- Check dependencies of X, if a container has been pushed quary.io or where ever for these dependencies
You have so far described dependency-centric containers. I'd like to go one step further and derive these from tool descriptions. I'm all about artifact-centric development with planemo and such.
Are you on board for this in a general way? I'll take a look at auto-mulled but if you have any advice on how to implement this specifically that would be great.
In terms of less important implementation details, I'd like to use the abstractions in galaxy-lib
to find "tools" - so that we are using abstract tool definitions that could work for CWL tools or future changes to Galaxy's definitions of tools.
I would ideally also like to do this planemo also if possible to stick developer tools in one place - so planemo mull <tool_source>
to build a container locally and planemo mull --publish <target=quay.io> <tool_source>
to publish. Then testing could be done with planemo test --mulled_dependency_resolution <tool_source>
, this could build and use the local mulled container for testing during development.
Once this is done, I'd also like to provide some CWL goodies to try to get the GA4GH to adopt our abstractions for dependency resolution and tooling - thus increasing our "surface for collaboration":
- For every tool X in repository Y:
- Check dependencies of X, if a container has been pushed quary.io or where ever for these dependencies
continue
. - Else, build and push a container for these dependencies.
- If tool.type == "cwl", publish to Dockerstore.
- Check dependencies of X, if a container has been pushed quary.io or where ever for these dependencies
Okay so @bgruening and I discussed this in-depth out of channel - there are a couple fundamental differences we are still working through. Here is a summary of my current thinking based on our conversation:
The basic confusion that prompted me to create this issue is that I think of "fat" tools - tools with multiple dependencies - as the default Galaxy use case and @bgruening was hoping these are more a special case (the truth I think is somewhere in the middle). This is why I thought tool-centric support was needed and he thought dependency-centric support was sufficient.
@bgruening proposed meta-packages in an IUC channel to reduce these to one dependency. I countered that the requirement
tags need to be abstract and should not so heavily depend on conda specific details (an opinion that I am certain @jxtx would share).
The conversation then moved on to discuss hashing the requirements consistent to generate fixed Docker image names for a set of requirements. I thought this could be uniformly done but @bgruening countered if n==1
(where n is the number of packages) the human readability of the mullled generated names currently is great. I think we both agreed that, though ugly, the compromise of hashing names if n>1
and using the current scheme for n==1
is workable and the way to go.
The conversation then moved to having Travis do everything through the mulled project vs. being able to build and test containers locally (which in turn got into issues of namespaces and such). I think we both agree that "strongly encouraging" the use of community projects and free resources like mulled, quay.io, and Travis is important - but I think I alone think it is fundamental that these resources need to be locally buildable and testable before publication/push/PR/etc.... This is the planemo philosophy - everything must be build-able and test-able in some way locally before publication - and this is what separates it from the old build and test framework of the Galaxy tool shed.
We can keep hashing this out - but this might be an example of where it is good to separation between the tool framework developer (me) and the IUC. I can provide the tooling I need to feel comfortable I'm not locking people into free services that may disappear and providing the kind of reproducibility that I feel is important to developers - and he can establish best practices that encourage certain namespaces, certain projects, etc...
I do think that @bgruening and I are in agreement though that if Galaxy has Docker enabled and there is not explicit container defined for a tool - Galaxy should be able to search for mulled containers at a list of prefixes: docker_namespace_paths=quay.io/mulled/,quay.io/biodocker/
.
So I see the current TODO list as:
- Update mulled backend to allow locally building.
- Update mulled backend to allow hashing multiple packages if present.
- Update galaxy-lib to do docker namespace searches based on published mulled containers.
- Implement
docker_namespace_paths
in Galaxy using galaxy-lib. - Implement
planemo mull <tool_sources>
andplanemo mull --publish <target> <tool_sources>
in planemo. - Implement
planemo test --mulled_dependency_resolution <tool_sources>
in planemo. - Add option to use mulled containers via galaxy-lib to cwltool to resolve
SoftwareRequirement
s - added to 1.0 CWL spec.
Just a few corrections and notes from my side:
@bgruening proposed meta-packages in an IUC channel to reduce these to one dependency. I
countered that the `requirement` tags need to be abstract and should not so heavily depend
on conda specific details (an opinion that I am certain @jxtx would share).
I do think this works very well for conda, but as I said I'm not a fan of this solution because a tool should be annotated (with dependencies) as detailed as possible and conda-meta-packages are hiding these complexity. So for Galaxy we should have all requirements annotated on their own if possible.
but I think I alone think it is fundamental that these resources need to be locally
buildable and testable before publication/push/PR/etc....
I do think we should be able to build containers locally and test everything locally, but I would discourage private Docker namespaces or private conda channels as much as possible. If this means no planemo support to push to arbitrary conda/Docker repos, than I'm in favour for this.
To be more clear here, github and travis is just a convenient solution to solve the authentication problems with these shared community repositories.
What ever mulled is doing can be done locally, today. But finally you have the problem of deploying the container to quay.io/dockerhub into one (preferably a small number) community namespace. This is where github/travis comes into play.
The conversation then moved on to discuss hashing the requirements consistent to generate
fixed Docker image names for a set of requirements. I thought this could be uniformly
done but @bgruening countered if ``n==1`` (where n is the number of packages)
the human readability of the mullled generated names currently is great. I think we both
agreed that, though ugly, the compromise of hashing names if ``n>1`` and using the
current scheme for ``n==1`` is workable and the way to go.
I like to think that mulled containers can be actually used by humans, typing the container name from scratch. So yes I would like to keep the names as they are if only one (conda) package is wrapped. This also makes it very convenient for the conda community to exchange conda and Docker dependencies for example for snakemake.
With hashing we meant something like, normalising a requirements.txt
file and hashing the content to a unique container name. This needs to be done on the mulled site (creating the container) but also on the Galaxy side (resolving the dependencies out of the requirement-tags).
If the requirements.txt
contains strict versioning every container will only have one revision, because a change in one version of one requirement will change the container name. This needs to be discussed - also in the foresight that I'm not sure we can host millions of repositories on quay.io/dockerhub.
I would like to add one additional note about the philosophy of mulled or the entire layer-donning concept. One reason we have chosen this concept is that it shifts the level of trust back to the package-manager (conda, brew, apk ...). Ones the mulled code is trusted we simply take what conda is offering and we trust the conda community. This comes with the additional benefit that a Docker container has the same content as the conda package, because it is build from the same recipe. Dockerfiles on the other hand are not the same as the conda recipes and you need to trust the author of the Dockerfile. This means the Galaxy Docker backend and conda backend can yield different results.
If we now allow people to push images to arbitrary repositories, or the other way around allow Galaxy to query arbitrary namespace on quay.io it is not guaranteed that these images are actually build with mulled, hence we need to trust the author of this images. This might sound like the same problem as of any other package manager, but in case of Docker it's worse. There are a lot of people that keep saying that Docker containers are a blackbox and it is nearly impossible to guarantee that the content of a container is what is should be. Mulled is solving this problem I think, but only if the build-chain can be trusted, aka the build final build should happen on a common build environment that can be trusted, not locally.
I'm not saying that testing should not happen locally, it should! I want to make the point that Docker images are kind of unique and the build-process is as important as the code and should be as transparent as the code.
Does this makes sense John?
I will close this now, as we have great support in Galaxy nowadays with various deployments.