Small API proposal: find out which variant of the genome is used, such as GRCh38.p13

Question

Small API proposal: find out which variant of the genome is used, such as GRCh38.p13

Closed this issue 2 years ago · 4 comments

Hey there kojix2,

I have a small-ish API suggestion to make. Please, as always, feel free to consider/modify/ignore
as it fits you.

Currently we can load the genome dataset via the API example you show such as:

hg38 = Bio::TwoBit.open("BSgenome.Hsapiens.UCSC.hg38/inst/extdata/single_sequences.2bit")

I wanted to correlate whether the subsequences we can obtain, such as via

hg38.sequence("chr1", 50000, 50050)

match to, for instance, the dataset here:

https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/

The variant there at NCBI is using:

Genome Reference Consortium Human Build 38 patch release 13 (GRCh38.p13)

Would it be possible to give class Bio::TwoBit the option to output:

from which file/URL it was obtained?

and

which variant it carries?

I leave it up to you which API names may seem best to pick here; I have a tendency
to use insanely long method names, whereas I believe you prefer shorter and more
succinct variants - all fine by me. If possible the main README could show the API
you would use in this regard as a fast entry point for new users as well.

Rationale for the proposal: I believe it would be useful for users to instantly
find out which dataset they are working with e. g. GRCh38.p13 or another
one. These may be identical between revisions, I understand it, but sometimes
some species may not always have a fully 100% correct sequence, so later
changes may show some differences, which I think is useful to know, in
particular for writing reproducible scripts. And some of these scripts may
be automated at a later time, which is why I think this would be useful.
(For hardcoded paths/URLs that rarely change the above may not be as
important, but I still think it may useful to have this functionality as-is.
Right now I am trying to find out what variant hg38 from UCSC refers
to, for instance.)

PS: Actually, I refer a bit to the primary URL such as:

https://bioconductor.org/packages/release/data/annotation/src/contrib/BSgenome.Hsapiens.UCSC.hg38_1.4.4.tar.gz

So my proposal is perhaps not ideal, because that information may be lost,
e. g. if the user used "wget" or something before that. So perhaps the
class could directly work with an URL? This may be quite convenient
for the user, e. g. to use open-uri, and then a system() call or something
for tar (may add another API for that, but again, this is just a suggestion,
I am sure you can consider on your own just fine whether it is necessary
or not. My point of view is more from convenience and reproducibility
in science. :) )

Answer 1 · 2022-01-03T04:57:18.000Z

Thank you for trying twobit.

twobit was created to leverage the assets of Bioconductor in Ruby. As you may know, Bioconductor is the name of a set of R tools widely used in bioinformatics and the project that supports them.

Ideally, I would like to create a Ruby package that corresponds to the Bioconductor package in R.
https://github.com/ruby-on-bioc/biocgem

The proposed features are certainly necessary, but we need to figure out how to implement them.

Answer 2 · 2022-01-14T03:21:46.000Z

Implemented TwoBit#path to partially solve this problem.

Answer 3 · 2022-01-14T03:47:10.000Z

https://github.com/misshie/bioruby-ucsc-api

Answer 4 · 2023-01-11T12:50:48.000Z

Added a mechanism to automatically download reference genomes using const_missing. There may still be some bugs, but I will fix them.