New item: Use of Docker images
jgriss opened this issue · 8 comments
Category: Workflow Software (or new section?)
Name: If containers are used in the analysis, they should be referenced through stable version numbers
Category: "bronze"
Description: If containers, such as Docker or Singularity containers, are used in the analysis
they should be referenced through stable version numbers. This explicitly relates to not
using the ":latest" tag for Docker images as these are bound to change upon new releases
of the software.
Fields: "all"
Name: If containers are used in the analysis, these should be available in a public repository
Category: "silver"
Description: If containers, such as Docker or Singularity containers, are used in the analysis
they should be available through a public repository, such as Docker Hub.
Fields: "all"
Reason
The use of containers becomes increasingly common and is increasingly supported by worfklow systems, such as nextflow
I think is better to reference here and standard community like biocontainers.pro to storage the containers. BioContainers define guidelines for containers creation and also provide an architecture to deploy and find bioinformatics contains.
In the same way, we suggested that the data should be in a ProteomeXchange repository. The container should be in Biocontainers.
This is a very good point! But I guess we should still have an item that says "if you use containers, reference them following these guidelines"?
I agree with @jgriss that the version number should be made explicit, rather than ":latest".
Storage in BioContainers should probably be silver/gold level, whereas just having the container available somewhere should suffice as well.
Maybe something like this:
- Bronze: container file publicly available on a third-party resource, can also just be on GitHub
- Silver: container uploaded to an official image registry, i.e. Docker Hub, ...
- Gold: container available via BioContainers
I agree with @jgriss that the version number should be made explicit, rather than ":latest".
The BioContainers guidelines do not allow to have the latest version since two years ago. All containers should contain the proper version.
Storage in BioContainers should probably be silver/gold level, whereas just having the container available somewhere should suffice as well.
Having the container in your own namespace only will create more issues because, if the namespace disappears what can you do with the version of the docker file?
Maybe something like this:
- Bronze: container file publicly available on a third-party resource, can also just be on GitHub
- Silver: container uploaded to an official image registry, i.e. Docker Hub, ...
- Gold: container available via BioContainers
I think we remove a lot of complexity saying that if a container is used the container should be deposited in biocontainers. Done
@ypriverol I agree that we remove complexity but we also exclude quite a few use-cases.
Biocontainers are great for packaging single tools. What if a group uses one container to run their whole workflow (this is an example that the nextflow guides showed quite a lot) and also put their custom scripts into that container? The container would never be suitable for biocontainers.
The same is btw. also true for IsoProt.
Therefore, I prefer @bittremieux suggestion
@ypriverol I agree that we remove complexity but we also exclude quite a few use-cases.
Biocontainers are great for packaging single tools. What if a group uses one container to run their whole workflow (this is an example that the nextflow guides showed quite a lot) and also put their custom scripts into that container? The container would never be suitable for biocontainers.
The container is fine for biocontainer as far as it fits the guidelines: version, description, title, etc. You can put the custom scripts in the container or a conda package. Multitool containers are also supported since a year ago.
The same is btw. also true for IsoProt.
Probably this is an example of why do we need to put the container into biocontainer. My point is, if the container is in your namespace and you delete your namespace, the container will be gone, even if you have the version.
If we are creating guidelines for reproducible research and the bioinformatics community already have two community like conda and biocontainers to create guidelines about how to deploy containers, how to build them.. what is the point of avoiding them? You can have your own container during the development process, however when you are in the process of making your publication your containers should be release and properly annotated using the biocontainers guidelines and namespace.
This is like saying that the proteomics data can be public in a university FTP, when we have progressed a lot in ProteomeXchange.
Therefore, I prefer @bittremieux suggestion
Hi guys,
Based on this discussion and an offline one with @ypriverol I updated my proposal. Since we are the first guideline to target Docker containers as well, I believe that this should be highlighted.
Category: Containers
Name: If containers are used in the analysis, they should be referenced following, for example, the BioContainers guidelines
Category: "bronze"
Description: If containers, such as Docker or Singularity containers, are used in the analysis
they should be referenced through stable version numbers. This explicitly relates to not
using the ":latest" tag for Docker images as these are bound to change upon new releases
of the software. Detailed suggestions can be found in the BioContainers documentation
Fields: "all"
Name: If containers are used in the analysis, these should be available in a public repository using non-personal namespaces
Category: "silver"
Description: If containers, such as Docker or Singularity containers, are used in the analysis
they should be available through a public repository, such as Docker Hub. The namespace used to make this image publicly available should not be under a "private", user-based namespace but should use some kind of institutional namespace where long-term availability is ensured. This addresses the risk, that if private namespaces are used and the person changes careers, the namespace might be deleted and the images thus lost.
Fields: "all"
Name: Containers should be available in dedicated repositories such as BioContainers
Category: "gold"
Description: Dedicated namespaces for bioinformatics tools ensure minimum standards of the containers and their long-term availability. Additionally, they have mechanisms in place to also support a wider range of platforms, such as BioConda.
Fields: "all"
Hi guys,
I've now added the proposed new items to the document. If you agree I'll close this issue and we will continue the discussion (if needed) on the different items.