yhoogstrate/fastafs

further compressing 2bit

Closed this issue · 1 comments

using conventional compression methods we'll achieve at most ~14% reduction and may lose the ability of indexed search

724334154 (100.0%) hg19.fastafs
-------------------------------------
628402000 ( 86.8%) hg19.fastafs.7z (-mx9)     [no random access]
631057408 ( 87.1%) hg19.fastafs.lzma (--best/-9)
645791778 ( 89.2%) hg19.fastafs.zstd (-19)
686197076 ( 94.7%) hg19.fastafs.bz2 (--best/-9)
687943579 ( 94.9%) hg19.fastafs.gz (--best/-9)     [no random access]
????????? ( 94.7%) hg19.fastafs.bgz (--best/-9)

compressing with recurrent substrings may be a thing to think of? this most likely required to first apply this compression and write this down as 2bit

If we do chunked compr, we can (re)balance on e.g. [AC/GT]. If overall AC is below 0.5, then the reverse complement will make AC above 0.5. Rebalancing costs 1 bit in storage and increases the probability in finding similar repeats.

Chunking could be done fixed length, or variable length (e.g. in low entropy regions - high entropy regions can be preserved). If the goal is in some point in the future to think of graph based approaches, software needs to be compatible with variable length anyway

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3868316/#B36

MFCompress: 23% compared to 2bit

Close this for now. Compression ratio's are approximately 5-10% unless extreme powerfull and time consuming algorithms are used. Compressing in a DNA specific way may be more convenient.