vanheeringen-lab/genomepy

please, allow to fix the version of ensembl!

antonkulaga opened this issue · 5 comments

Often for consistency, there is a need to fix version of ensembl to be compatible with other data sources, so far in python documentation it takes time to understand how to do it. As ensemble 108 is totally broken in terms of annotation download in genomepy it would be nice to explain better how to fall down to older versions

Hi @antonkulaga, would the --Ensembl-version command-line flag work for you? It's under the provider-specific options when you run genomepy install -h :

Provider specific options:
  --Ensembl-toplevel              always download toplevel-genome
  --Ensembl-version INTEGER       select release version
  --UCSC-annotation TEXT          specify annotation to download: ncbiRefSeq,
                                  refGene, ensGene, knownGene (case-
                                  insensitive)
[...]

If there is another issue, please explain what you would like to do in more detail or provide the steps to reproduce the issue.

Can't speak for @antonkulaga, but --Ensembl-version works well enough for me :)

However, there a couple of other issues I am facing with Ensembl releases:

  1. If I have already previously downloaded one version of Ensembl annotations for a particular genome, genomepy just silently ignores my request to fetch another version. It would be nice if genomepy were to handle multiple annotation versions. Or, if that is too complicated, override the old version. As a workaround, for now I am including the Ensembl release version in the genomes_dir argument. This works, but is wasteful - because the genome is downloaded for every set of annotations.
  2. Related to 1.: It is unclear to me what will happen if I want to fetch the latest version of Ensembl annotations when I have already fetched another (latest or particular version) to the same directory previously. My expectation would be that a new set of annotations is fetched (overriding the old one, or better, in addition to the old one; see 1.) if Ensembl has released a new version since I have fetched the last set of annotations, and that nothing is fetched if the available annotation is still up to date. However, given the behavior described in 1., my suspicion is that in either case nothing would be fetched. Unfortunately, I can't test this until Ensembl releases a new set of annotations.
  3. When providing an --Ensembl-version that is not available yet (i.e., it is higher than the latest currently available version), genomepy silently downloads the latest one. Following the principle of least surprise, I would expect an error being raised (which is also what happens when I provide a very old version that is not available anymore). At the very least, it would be nice to have this behavior documented.

Hey uniqueg,

genomepy does not override files by default (to do just that, you can pass the --force flag to the install command). This means that downloads from older Ensembl releases aren't overwritten by default either. Ensembl can silently update their annotations between releases, so it's hard to work with programmatically.

Your workaround seems like a good one! Tip: you can also pass the --only-annotation flag to the install command to only download the annotation files.

Your 3rd point is excellent. I'll implement that.

Thanks a lot @siebrenf. Interesting (and scary!) to learn that Ensembl releases may be unstable, I did not know that! The concept of releases and their perceived benefits for reproducibility are probably the major reason why we are preferring Ensembl over other providers.

Thanks also for the --only-annotation hint. Between that and --force, with some additional management logic, we should probably be able to nest different annotations for a given assembly and that both genome and annotation files are not downloaded if they are already available. But of course the question is if that whole strategy is really viable if releases are (or may be) unstable...

Can you access hash sums for the genome and annotation files via (some of) the providers' APIs? If so, perhaps genomepy could report them. We could then use them, where available, instead of releases/versions to minimize re-downloading files and for better reproducibility.

I had a look at the Ensembl REST API but could not find a simple way to mine the hash sums. That said, I would guess the changes between releases are minor.

Improved handling of older Ensembl releases is live in #240, try it with:

pip install git+https://github.com/vanheeringen-lab/genomepy.git@dens