further compressing 2bit
Closed this issue · 1 comments
using conventional compression methods we'll achieve at most ~14% reduction and may lose the ability of indexed search
724334154 (100.0%) hg19.fastafs
-------------------------------------
628402000 ( 86.8%) hg19.fastafs.7z (-mx9) [no random access]
631057408 ( 87.1%) hg19.fastafs.lzma (--best/-9)
645791778 ( 89.2%) hg19.fastafs.zstd (-19)
686197076 ( 94.7%) hg19.fastafs.bz2 (--best/-9)
687943579 ( 94.9%) hg19.fastafs.gz (--best/-9) [no random access]
????????? ( 94.7%) hg19.fastafs.bgz (--best/-9)
compressing with recurrent substrings may be a thing to think of? this most likely required to first apply this compression and write this down as 2bit
If we do chunked compr, we can (re)balance on e.g. [AC/GT]. If overall AC is below 0.5, then the reverse complement will make AC above 0.5. Rebalancing costs 1 bit in storage and increases the probability in finding similar repeats.
Chunking could be done fixed length, or variable length (e.g. in low entropy regions - high entropy regions can be preserved). If the goal is in some point in the future to think of graph based approaches, software needs to be compatible with variable length anyway
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3868316/#B36
MFCompress: 23% compared to 2bit
Close this for now. Compression ratio's are approximately 5-10% unless extreme powerfull and time consuming algorithms are used. Compressing in a DNA specific way may be more convenient.