singularityhub/shpc-registry

Automatic update biocontainer versions from quay.io

Closed this issue · 14 comments

Is your feature request related to a problem? Please describe.
The versions on quay.io for biocontainers are sometimes newer as the ones in the shpc-registry

Describe the solution you'd like
For example if I search for fastp with shpc show -f fastp --versions I get 2 versions as declared here: https://github.com/singularityhub/shpc-registry/blob/main/quay.io/biocontainers/fastp/container.yaml But quay.io has many more: https://quay.io/repository/biocontainers/fastp?tab=tags

$ shpc show -f fastp --versions
quay.io/biocontainers/bioconductor-rfastp:1.4.0--r41hc247a5b_2
quay.io/biocontainers/bioconductor-rfastp:1.8.0--r42hc247a5b_0
quay.io/biocontainers/bioconductor-rfastp:1.8.0--r42hf17093f_1
quay.io/biocontainers/fastp:0.22.0--h2e03b76_0
quay.io/biocontainers/fastp:0.23.2--h5f740d0_3
quay.io/biocontainers/fastpca:0.9.1
quay.io/biocontainers/fastphylo:1.0.3--h648b6df_5
quay.io/biocontainers/fastphylo:1.0.3--h65d3618_7

Also, the registry list is not always up to date with quay.io. For example spades version 3.15.5 is not there yet, but it has been released already for almost a year

$ shpc show -f spades --versions
ghcr.io/autamus/spades:3.15.2
ghcr.io/autamus/spades:3.15.3
ghcr.io/autamus/spades:latest
ghcr.io/autamus/spades:3.15.5
quay.io/biocontainers/spades:3.15.4--h95f258a_0

Maybe it is possible to update the registry automatic with a github bot?

Describe alternatives you've considered
Not really something except from using singularity/apptainer directly

vsoch commented

Hi @mdehollander !

Maybe it is possible to update the registry automatic with a github bot?

We indeed do have a GitHub bot - you can see his daily pushes here: https://github.com/singularityhub/shpc-registry/commits/main

The issue when we added over 8K containers is that running updates for all of them at once did not fit in the scope of a run, so we adopted the strategy to update them incrementally. This means they are not always up to date, and if there are a particular set you'd like to update, you can see the shpc update command to find the new versions, and please open a PR here to get them into the registry proper.

We could definitely tweak our update schedule to be run multiple times in a day (and thus more frequently) but we'd need a good algorithm for determining the set! Right now we do it based on assigning groups to the day of the month. If you have ideas you'd like to test we would welcome the contribution!

I did notice the bot after I posted the message. Good that is there and understand the limitations.

It seems shpc uppdate does not pick up the newer spades version:

$ shpc update quay.io/biocontainers/spades --dry-run
Looking for updated digests for quay.io/biocontainers/spades
>> quay.io/biocontainers/spades
>> Latest
3.15.4--h95f258a_0:sha256:7dfda44ae2535ba1ccc7c60c2ec265f8672cfd45885f458a964daf1b839a7ec1
>> Tags
3.15.4--h95f258a_0:sha256:7dfda44ae2535ba1ccc7c60c2ec265f8672cfd45885f458a964daf1b839a7ec1

The 3.15.5 tag is also not present at the crane website that is being queried: https://crane.ggcr.dev/ls/quay.io/biocontainers/spades So that seems to be the problem.

How does updating work? Because when I remove the dry-run option, it gives an error: ValueError: Remote save to a GitHub registry is not supported. It seems you can only update local registries.

vsoch commented

How does updating work? Because when I remove the dry-run option, it gives an error: ValueError: Remote save to a GitHub registry is not supported. It seems you can only update local registries.

That's correct - the update is making changes to a local file. You'd want to run the update on your clone of the registry and open a PR. It wouldn't make sense to somehow give anyone global access to edit a recipe in a repository they don't own from the command line.

If there is a version you like and it's not being picked up by crane, you could also add it manually (and just grab the digest). If this is a larger issue with crane we would need to use a different approach.

Ok, thanks. I know enough now. I will check if this happens more often and look at the version at crane. If needed, indeed open a PR.

I would like to reopen this issue, since I think it is a general problem. Although not with the shpc-registry directly, but with crane.ggcr.dev. What would be the best approach so solve this?

With at least several packages I tried shpc does not get the latest version because ggcr.dev does not list it. For example bowtie2, diamond, spades and samtools are not listed with their latest versions.

With shpc the latest version of bowtie2 is 2.4.5:

$ shpc show -f bowtie2 --versions
ghcr.io/autamus/bowtie2:2.4.2
ghcr.io/autamus/bowtie2:latest
quay.io/biocontainers/bioconductor-rbowtie2:2.0.0--r41he06c1ba_2
quay.io/biocontainers/bioconductor-rbowtie2:2.4.0--r42he06c1ba_0
quay.io/biocontainers/bioconductor-rbowtie2:2.4.0--r42h639f7a0_1
quay.io/biocontainers/bioconductor-rbowtie2:2.6.0--r43h639f7a0_0
quay.io/biocontainers/bowtie2:2.3.5.1--py37he513fc3_0
quay.io/biocontainers/bowtie2:2.4.5--py36hd4290be_0

This is can also seen in the registry: https://github.com/singularityhub/shpc-registry/blob/main/quay.io/biocontainers/bowtie2/container.yaml

url: https://biocontainers.pro/tools/bowtie2
maintainer: '@vsoch'
description: shpc-registry automated BioContainers addition for bowtie2
latest:
  2.4.5--py36hd4290be_0: sha256:7c547046fcb6f742789a741ef52289f174edb75d46db6d835d654e673cd2dafc

On ggcr.dev (https://crane.ggcr.dev/ls/quay.io/biocontainers/bowtie2) version 2.4.5 is not listed, but it stops at 2.3.4

The latest version of bowtie is 2.5.2: https://github.com/BenLangmead/bowtie2/releases

And you get this with micromamba:

$ micromamba search -c bioconda bowtie2
Getting repodata from channels...

bioconda/linux-64                                           Using cache
bioconda/noarch                                             Using cache


bowtie2 2.5.2 py39h6fed5c7_0 (+ 2 builds)
_________________________________________

  Name            bowtie2
  Version         2.5.2

Or with bioconda2biocontainer:

$ bioconda2biocontainer --package_name bowtie2 | sort -k2 -r -n
bowtie2-2.5.2	2.5.2	http://api.biocontainers.pro/ga4gh/trs/v2/tools/bowtie2/versions/bowtie2-2.5.2

Except form manually update the shpc-registry with PRs, is there something that can be done to get it the latest versions of tools? Should someone from ggcr.dev be contacted?

vsoch commented

@mdehollander what about using oras? Here is the command line (in Go) tool:

oras repo tags quay.io/biocontainers/bowtie2 | sort

https://oras.land/docs/installation

Coming back to this issue a bit late, but run into the limitation again. I checked with the crane command locally and there I get more recent version that on the web version at https://crane.ggcr.dev/.

To get the latest version of bowtie2 installed with shpc install quay.io/biocontainers/bowtie2, where would be the fix needed. At the crane.ggcr.dev website or at the shpc-registry?

Running this locally go/bin/crane ls quay.io/biocontainers/bowtie2 gives

2.5.1--py38he00c5e5_2
2.5.2--py310ha0a81b8_0
2.5.2--py38he00c5e5_0
2.5.2--py39h6fed5c7_0
2.5.3--py38he00c5e5_0
2.5.3--py39h6fed5c7_0
2.5.3--py310ha0a81b8_0

On the crane.ggcr.dev the latest version is 2.3.4.3: https://crane.ggcr.dev/ls/quay.io/biocontainers/bowtie2

Where would oras come into play? Would I run that myself or is it something that can be integrated with shpc and replace crane?

The issue is that crane cuts the tag listing to 50, and it’s not sorted. If ORAS has an endpoint to list tags (all tags) we can replace crane and improve upon that here.

I do believe the Oras cli in go has that, so if you can find the underlying call (e.g url and params) that should be enough for me to fix here. Thank you!

Do you use the crane web listing? That limits indeed to 50 tags, but the command gives all tags. If you sort them and compare it to the oras cli, the results are identical:

$ ./go/bin/crane ls quay.io/biocontainers/spades | sort -t '-' -k 1,2 -V -r
3.15.5--h95f258a_1
3.15.5--h95f258a_0
3.15.4--h95f258a_0
3.15.3--h95f258a_1
3.15.3--h95f258a_0
3.15.2--h95f258a_1
3.15.2--h633aebb_0
...
$ oras repo tags quay.io/biocontainers/spades | sort -t '-' -k 1,2 -V -r
3.15.5--h95f258a_1
3.15.5--h95f258a_0
3.15.4--h95f258a_0
3.15.3--h95f258a_1
3.15.3--h95f258a_0
3.15.2--h95f258a_1
3.15.2--h633aebb_0
...

Another example with bowtie2, which has 188 tags. The crane cli gives the latest release:

$ ./go/bin/crane ls quay.io/biocontainers/bowtie2 | sort -t '-' -k 1,2 -V -r | head -n 1
2.5.3--py39h6fed5c7_0

Does this help to fix it here?

Looks like they are using the native api https://github.com/google/go-containerregistry/blob/8b3c3036d612bcb3c1147fe11c2d1818dc432329/pkg/v1/remote/list.go#L52-L62, and I suspect I fell back to crane because there is a lot of variation in the auth URL. We'd basically need to reproduce that in Python, and here is an example: https://github.com/al4/docker-registry-list/blob/master/docker-registry-list.py. The file is shpc/main/container/update/docker.py For crane, some images seem to return all tags, e.g., https://crane.ggcr.dev/ls/ubuntu

I'm reading that quay doesn't follow the distribution spec, which is probably why it's a special case. It looks like I can add the limit and page parameters to get unique tags:

 curl -X GET "https://quay.io/api/v1/repository/biocontainers/bowtie2/tag/?limit=100&page=1" | jq -r .tags[].name | sort | uniq 
 curl -X GET "https://quay.io/api/v1/repository/biocontainers/bowtie2/tag/?limit=100&page=2" | jq -r .tags[].name | sort | uniq 

Aside from quay, are there other registries that give a shortened response? If this is just a quay bug we can look for quay.io and use the above endpoints (with the params shown, and likely paginate the response) directly.

okay give this a whirl singularityhub/singularity-hpc#671

I don't have a local registry on this machine, but a quick run shows it finds the latest 2.5.3.

$ shpc update quay.io/biocontainers/bowtie2
Looking for updated digests for quay.io/biocontainers/bowtie2
>> quay.io/biocontainers/bowtie2
>> Latest
- 2.4.5--py36hd4290be_0:sha256:7c547046fcb6f742789a741ef52289f174edb75d46db6d835d654e673cd2dafc
+ 2.5.3--py310ha0a81b8_0:sha256:d47b6d436f788475fc853b2ab956daaa3c5a390e4c39e2c275740fa92b6c3b9c
>> Tags
+ 2.5.3--py310ha0a81b8_0:sha256:d47b6d436f788475fc853b2ab956daaa3c5a390e4c39e2c275740fa92b6c3b9c
+ 2.4.5--py37hb24965f_4:sha256:7b9324d3f60f40157bb8194eafd3b9d56682aafc7c0317f83398cdae7fa5dee2
2.4.5--py36hd4290be_0:sha256:7c547046fcb6f742789a741ef52289f174edb75d46db6d835d654e673cd2dafc
2.3.5.1--py37he513fc3_0:sha256:361034b738118d023b5ed35b070458864f23bf63de09017ac30d08ff48a815b0

Thanks for the quick fix. And yes, indeed it gets the latest version using the quay.io api. I don't have a local registry here as well, so I get a 'Remote save to a GitHub registry is not supported.' But it seems to work.

This should be merged and released - I can run it against a local clone of the registry to get updates across the board soon. Thanks @mdehollander !