lightweight genotype service
Closed this issue · 1 comments
draft architecture :
components :
- lightweight genotype service
- Pretzel
lightweight genotype service
- exposes a web API for Pretzel to use :
getSamples(datasetId)
getGenotypes(datasetId, samples, region, format) - multiple storage mechanisms, a selection of :
DivBrowse
database
bcftools
Germinate
Pretzel
- uses the lightweight genotype web API in parallel to the functions which access genotypes from bcftools and Germinate
tasks
- large-scale performance comparison of : DivBrowse / database / bcftools / ...
find candidate databases e.g.
https://www.google.com/search?&q=column+storage+genotype+database
selected search results :
https://academic.oup.com/database/article/doi/10.1093/database/baz096/5566651
"... HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix."
"To determine the ideal technology to serve as the backend of the GOBii-GDM, testing was performed using a large genotype-by-sequencing (GBS) dataset (15, 16, 19). Open-source RDBMS, PostgreSQL and MariaDB, a community-developed fork under the GNU GPL of MySQL, were used as a baseline for performance testing and compared with HDF5, MonetDB, Elasticsearch (17), Spark (18), and MongoDB. "
https://github.com/gobiiproject/GOBii-System
https://www.intel.in/content/dam/www/public/us/en/documents/white-papers/genomics-storing-genome-data-paper.pdf
"... The low-level storage format
enables faster and more efficient retrievals from disk compared to the use of files.
Additionally, using libraries optimized for Intel® architecture to compress data
on disk, GenomicsDB cumulatively achieves orders of magnitude improvement
in performance compared to existing tools. In addition, the generalized multi-
dimensional array model provides flexibility for GenomicsDB to be extended to
other types of genome data. ... "
- check whether : one DivBrowse server provides access to a single .vcf.gz;
The lightweight genotype service will need to track large numbers of samples / assays / VCFs, so database capability will be required for that.
This work is now part of Genolink - https://github.com/plantinformatics/genolink.