lightweight genotype service

Question

lightweight genotype service

Closed this issue 2 months ago · 1 comments

Don-Isdale commented 2 years ago

draft architecture :

components :

lightweight genotype service
Pretzel

lightweight genotype service

exposes a web API for Pretzel to use :
getSamples(datasetId)
getGenotypes(datasetId, samples, region, format)
multiple storage mechanisms, a selection of :
DivBrowse
database
bcftools
Germinate

Pretzel

uses the lightweight genotype web API in parallel to the functions which access genotypes from bcftools and Germinate

tasks

large-scale performance comparison of : DivBrowse / database / bcftools / ...
find candidate databases e.g.
https://www.google.com/search?&q=column+storage+genotype+database

selected search results :
https://academic.oup.com/database/article/doi/10.1093/database/baz096/5566651
"... HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix."
"To determine the ideal technology to serve as the backend of the GOBii-GDM, testing was performed using a large genotype-by-sequencing (GBS) dataset (15, 16, 19). Open-source RDBMS, PostgreSQL and MariaDB, a community-developed fork under the GNU GPL of MySQL, were used as a baseline for performance testing and compared with HDF5, MonetDB, Elasticsearch (17), Spark (18), and MongoDB. "
https://github.com/gobiiproject/GOBii-System

https://www.intel.in/content/dam/www/public/us/en/documents/white-papers/genomics-storing-genome-data-paper.pdf
"... The low-level storage format
enables faster and more efficient retrievals from disk compared to the use of files.
Additionally, using libraries optimized for Intel® architecture to compress data
on disk, GenomicsDB cumulatively achieves orders of magnitude improvement
in performance compared to existing tools. In addition, the generalized multi-
dimensional array model provides flexibility for GenomicsDB to be extended to
other types of genome data. ... "

check whether : one DivBrowse server provides access to a single .vcf.gz;

The lightweight genotype service will need to track large numbers of samples / assays / VCFs, so database capability will be required for that.

Answer 1 · 2024-11-08T01:50:34.000Z

This work is now part of Genolink - https://github.com/plantinformatics/genolink.