plantinformatics/pretzel

lightweight genotype service

Closed this issue · 1 comments

draft architecture :

components :

  • lightweight genotype service
  • Pretzel

lightweight genotype service

  • exposes a web API for Pretzel to use :
    getSamples(datasetId)
    getGenotypes(datasetId, samples, region, format)
  • multiple storage mechanisms, a selection of :
    DivBrowse
    database
    bcftools
    Germinate

Pretzel

  • uses the lightweight genotype web API in parallel to the functions which access genotypes from bcftools and Germinate

tasks

selected search results :
https://academic.oup.com/database/article/doi/10.1093/database/baz096/5566651
"... HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix."
"To determine the ideal technology to serve as the backend of the GOBii-GDM, testing was performed using a large genotype-by-sequencing (GBS) dataset (15, 16, 19). Open-source RDBMS, PostgreSQL and MariaDB, a community-developed fork under the GNU GPL of MySQL, were used as a baseline for performance testing and compared with HDF5, MonetDB, Elasticsearch (17), Spark (18), and MongoDB. "
https://github.com/gobiiproject/GOBii-System

https://www.intel.in/content/dam/www/public/us/en/documents/white-papers/genomics-storing-genome-data-paper.pdf
"... The low-level storage format
enables faster and more efficient retrievals from disk compared to the use of files.
Additionally, using libraries optimized for Intel® architecture to compress data
on disk, GenomicsDB cumulatively achieves orders of magnitude improvement
in performance compared to existing tools. In addition, the generalized multi-
dimensional array model provides flexibility for GenomicsDB to be extended to
other types of genome data. ... "

  • check whether : one DivBrowse server provides access to a single .vcf.gz;

The lightweight genotype service will need to track large numbers of samples / assays / VCFs, so database capability will be required for that.

This work is now part of Genolink - https://github.com/plantinformatics/genolink.