Feature request: genome sizes
bricoletc opened this issue · 3 comments
Could the genomes
subcommand conceivably output the (actual, or estimated) size of each available genome? I think this is a useful bit of info, to either just recall for a given species or to help decide which to choose.
Is that metadata available from that downloaded from the various providers?
If this feature is of interest/feasible I'd be happy to put in a PR
hey @bricoletc!
We could get the total genome size of genomes using the NCBI assembly reports, but that would require each report to be read separately. With an additional CLI/API flag to turn this on, I see no downside to this :)
I'll have a look at this soon-ish.
I also checked if we can get this from the metadata of each provider. Unfortunately, only Ensembl has this info readily available.
If that is enough, you can access it with:
import genomepy as gp
ens = gp.providers.create("ensembl")
# total genome size
print(ens.genomes["GRCh38.p13"]["base_count"])
# total genome size from search results
search_term = "GRC"
for hit in ens.search(search_term):
name = hit[0]
print(name, ens.genomes[name].get("base_count", -1))
Once a genome is installed locally, getting the approximate effective genome size is trivial though. A benefit of this is that you get the approximate effective size of the processed genome:
import genomepy as gp
hg38 = gp.Genome('hg38')
# approximate effective genome size
sum(hg38.sizes.values()) - sum(hg38.gaps.values())
With genomepy 0.13 you can get the absolute genome sizes with genomepy search
and genomepy genomes
using argument --size
!