vanheeringen-lab/genomepy

Feature request: genome sizes

bricoletc opened this issue · 3 comments

Could the genomes subcommand conceivably output the (actual, or estimated) size of each available genome? I think this is a useful bit of info, to either just recall for a given species or to help decide which to choose.

Is that metadata available from that downloaded from the various providers?

If this feature is of interest/feasible I'd be happy to put in a PR

hey @bricoletc!

We could get the total genome size of genomes using the NCBI assembly reports, but that would require each report to be read separately. With an additional CLI/API flag to turn this on, I see no downside to this :)
I'll have a look at this soon-ish.

I also checked if we can get this from the metadata of each provider. Unfortunately, only Ensembl has this info readily available.
If that is enough, you can access it with:

import genomepy as gp

ens = gp.providers.create("ensembl")

# total genome size
print(ens.genomes["GRCh38.p13"]["base_count"])

# total genome size from search results
search_term = "GRC"
for hit in ens.search(search_term):
    name = hit[0]
    print(name, ens.genomes[name].get("base_count", -1))

Once a genome is installed locally, getting the approximate effective genome size is trivial though. A benefit of this is that you get the approximate effective size of the processed genome:

import genomepy as gp

hg38 = gp.Genome('hg38')

# approximate effective genome size
sum(hg38.sizes.values()) - sum(hg38.gaps.values())

With genomepy 0.13 you can get the absolute genome sizes with genomepy search and genomepy genomes using argument --size!

Omg that's amazing @siebrenf !!
Just tried it and works like a charm 🥳
Thanks 🚀