biopragmatics/bioversions

Lighter dependency installation for reading versions

Closed this issue · 2 comments

The dependencies are a bit heavy:

bioversions/setup.cfg

Lines 45 to 60 in 5440e5a

install_requires =
requests
requests_ftp
beautifulsoup4
cachier<=1.5.0
pystow>=0.1.0
click
click_default_group
dataclasses; python_version < "3.7"
dataclasses_json
tabulate
more_click
pyyaml
tqdm
bioregistry>=0.2.6
lxml

I imagine this is because these dependencies are required to generate docs/_data/versions.yml.

For related-sciences/ensembl-genes#1, I just want to get the latest version of a resource like:

import bioversions

ensembl_version = bioversions.get_version("ensembl")

Would it make sense to have a dependency set that just supports reading versions that have already been aggregated into versions.yml?

Users can just get this by URL, but is the format stable? Is there also a JSON version that would avoid having to install a yaml parser?

I just made a note in related-sciences/ensembl-genes#1 (comment) on how to do this by directly getting some JSON. The format is stable so I'd say you can depend on it looking like this. Maybe I will add an additional metadata field or two from the bioregistry for convenience in the future, but that wouldn't break anything.

I guess it's the case that there are a lot of dependencies, but most of them are small utilities that I'd expect most environments to have if they're installing any other common stuff. For things like pystow and the bioregistry, I have been careful to keep them as lean as possible so they don't install a lot of transitive dependencies. I would be hesitant to remove some of the dependencies like pyyaml, lxml, beautifulsoup4, requests_ftp because most users won't/shouldn't have to know which ones will be used by each getter. I think it would be pretty confusing to have a lean version of bioversions that just supports looking stuff up in the JSON when most usage of this package directly is to interact with the sites on demand.

most usage of this package directly is to interact with the sites on demand

I see. I like having your CI do the interaction and for us to consume the output.

The JSON approach is simple enough for us:

import requests

url = "https://raw.githubusercontent.com/biopragmatics/bioversions/main/src/bioversions/resources/versions.json"
res_json = requests.get(url).json()
versions = {
    entry["prefix"]: entry["releases"][-1]["version"]
    for entry in res_json["database"]
    if "prefix" in entry
}
ensembl_version = versions["ensembl"]

I'd expect most environments to have if they're installing any other common stuff

I noticed because lot's of packages were added to poetry.lock in related-sciences/ensembl-genes@8f3ac75.