The bsc-m03 is experimental block sorting compressor based on M03 context aware compression algorithm invented by Michael Maniscalco:
- Michael Maniscalco M03: A solution for context based blocksort (BWT) compression, 2004
- Jurgen Abel Post BWT stages of the Burrows-Wheeler compression algorithm, 2010
Moreover, the bsc-m03 compressor is a practical implementation of Compression via Substring Enumeration for byte-oriented sources:
- Danny Dube, Vincent Beaudoin Lossless Data Compression via Substring Enumeration, 2010
- Takahiro Ota, Hiroyoshi Morita, Akiko Manada Compression by Substring Enumeration with a Finite Alphabet Using Sorting, 2018
Copyright (c) 2021-2023 Ilya Grebnov ilya.grebnov@gmail.com
The bsc-m03 is released under the GNU General Public License
- 2023-05-08 : Version 0.5.5
- Fixed segmentation fault on Unix based systems.
- 2022-11-27 : Version 0.5.0
- Compression ratio improvements.
- 2022-11-20 : Version 0.4.0
- Compression ratio improvements.
- 2022-11-10 : Version 0.3.0
- Compression ratio improvements.
- 2022-01-08 : Version 0.2.1
- Performance improvements.
- 2022-01-05 : Version 0.2
- Memory usage improvements.
- Compression ratio improvements.
- 2021-12-07 : Version 0.1.1 - 0.1.2
- Minor compression ratio improvements.
- 2021-12-03 : Version 0.1.0
- Initial public release of the bsc-m03.
File name |
Input size (bytes) |
Output size (bytes) |
Bits per symbol |
bib |
111261 |
24479 |
1.760 |
book1 |
768771 |
203745 |
2.120 |
book2 |
610856 |
138870 |
1.819 |
geo |
102400 |
52465 |
4.099 |
news |
377109 |
105621 |
2.241 |
obj1 |
21504 |
9775 |
3.637 |
obj2 |
246814 |
68003 |
2.204 |
paper1 |
53161 |
14957 |
2.251 |
paper2 |
82199 |
22594 |
2.199 |
pic |
513216 |
44424 |
0.692 |
progc |
39611 |
11257 |
2.274 |
progl |
71646 |
13512 |
1.509 |
progp |
49379 |
9248 |
1.498 |
trans |
93695 |
15310 |
1.307 |
File name |
Input size (bytes) |
Output size (bytes) |
Bits per symbol |
alice29.txt |
152089 |
38562 |
2.028 |
asyoulik.txt |
125179 |
35889 |
2.294 |
cp.html |
24603 |
6872 |
2.235 |
fields.c |
11150 |
2685 |
1.926 |
grammar.lsp |
3721 |
1120 |
2.408 |
kennedy.xls |
1029744 |
57440 |
0.446 |
lcet10.txt |
426754 |
94823 |
1.778 |
plrabn12.txt |
481861 |
129770 |
2.154 |
ptt5 |
513216 |
44424 |
0.692 |
sum |
38240 |
11426 |
2.390 |
xargs.1 |
4227 |
1585 |
3.000 |
File name |
Input size (bytes) |
Output size (bytes) |
Bits per symbol |
bible.txt |
4047392 |
698395 |
1.380 |
E.coli |
4638690 |
1126125 |
1.942 |
world192.txt |
2473400 |
376173 |
1.217 |
File name |
Input size (bytes) |
Output size (bytes) |
Bits per symbol |
dickens |
10192446 |
2199344 |
1.726 |
mozilla |
51220480 |
15589159 |
2.435 |
mr |
9970564 |
2156826 |
1.731 |
nci |
33553445 |
1126386 |
0.269 |
ooffice |
6152192 |
2503991 |
3.256 |
osdb |
10085684 |
2223002 |
1.763 |
reymont |
6627202 |
958772 |
1.157 |
samba |
21606400 |
3794300 |
1.405 |
sao |
7251944 |
4649723 |
5.129 |
webster |
41458703 |
6253627 |
1.207 |
xml |
5345280 |
357958 |
0.536 |
x-ray |
8474240 |
3681388 |
3.475 |
File name |
Input size (bytes) |
Output size (bytes) |
Bits per symbol |
chr22.dna |
34553758 |
7206269 |
1.668 |
etext99 |
105277340 |
21422251 |
1.628 |
gcc-3.0.tar |
86630400 |
10046880 |
0.928 |
howto |
39422105 |
7504315 |
1.523 |
jdk13c |
69728899 |
2612434 |
0.300 |
linux-2.4.5.tar |
116254720 |
16351863 |
1.125 |
rctail96 |
114711151 |
9707347 |
0.677 |
rfc |
116421901 |
14871775 |
1.022 |
sprot34.dat |
109617186 |
17157222 |
1.252 |
w3c2 |
104201579 |
5598687 |
0.430 |
Maximum Compression Corpus
File name |
Input size (bytes) |
Output size (bytes) |
Bits per symbol |
A10.jpg |
842468 |
823533 |
7.820 |
AcroRd32.exe |
3870784 |
1555832 |
3.216 |
english.dic |
465211 |
145096 |
2.495 |
FlashMX.pdf |
4526946 |
3712716 |
6.561 |
FP.LOG |
20617071 |
502648 |
0.195 |
MSO97.DLL |
3782416 |
1878076 |
3.972 |
ohs.doc |
4168192 |
803171 |
1.542 |
rafale.bmp |
4149414 |
745470 |
1.437 |
vcfiu.hlp |
4121418 |
604165 |
1.173 |
world95.txt |
2988578 |
442271 |
1.184 |
Large Text Compression Benchmark Corpus
File name |
Input size (bytes) |
Output size (bytes) |
Bits per symbol |
enwik8 |
100000000 |
20263925 |
1.621 |
enwik9 |
1000000000 |
160018905 |
1.280 |
File name |
Input size (bytes) |
Output size (bytes) |
Bits per symbol |
dblp.xml |
296135874 |
21926695 |
0.592 |
dna |
403927746 |
86414423 |
1.711 |
english.1024MB |
1073741824 |
193810792 |
1.444 |
pitches |
55832855 |
16984071 |
2.434 |
proteins |
1184051855 |
304486803 |
2.057 |
sources |
210866607 |
29749020 |
1.129 |
Pizza & Chilli Repetitive Corpus
File name |
Input size (bytes) |
Output size (bytes) |
Bits per symbol |
cere |
461286644 |
8576879 |
0.149 |
coreutils |
205281778 |
4293243 |
0.167 |
einstein.de.txt |
92758441 |
132286 |
0.011 |
einstein.en.txt |
467626544 |
336029 |
0.006 |
Escherichia_Coli |
112689515 |
7928044 |
0.563 |
influenza |
154808555 |
1760692 |
0.091 |
kernel |
257961616 |
2955825 |
0.092 |
para |
429265758 |
10730998 |
0.200 |
world_leaders |
46968181 |
518220 |
0.088 |
fib41 |
267914296 |
83 |
0.000 |
rs.13 |
216747218 |
86 |
0.000 |
tm29 |
268435456 |
158 |
0.000 |