storj-archived/kfs

Disable Compression

yuilleb opened this issue · 12 comments

It looks like compression might be enabled by default: https://github.com/Storj/kfs/blob/master/lib/s-bucket.js#L69

I would suggest disabling compression completely as the vast majority of data stored is assumed to be encrypted first. Since encrypted data is essentially random data, compression should actually increase the storage costs which in the end is a waste of cpu and disk space.

I don't have in depth knowledge of how snappy works, but I know in general compression takes advantage of patterns in common media types. However encrypted data should have no such patterns and thus you end up taking up more space for the compression block headers with no compression achieved.

I could be wrong so if someone could easily do a test to verify this is the case that might be a worthy test.

I've experienced the same, and have noticed little benefit from using snappy compression with obfuscated data (xor'd), for the reasons you've described, only around 3 percent savings.

I should add, if compression is desired, it should be done before encryption on the client I would think.

Agreed with both comments - this is likely a huge drain on I/O resources and CPU and should at least be disabled by default. Provide an option to enable it if you must.

I would be happy to do it :). I was kind of thinking if someone could do a test storing encrypted data with the compress option on then off and compare that would be wise. Theoretically encrypted data should never compress, but if we could test it that would be best. But if we just want to assume this is right I'm happy to agree :D

I believe your theory is so well documented in history that testing is not required. Random bits only compress randomly... sometimes you win, sometimes you lose, but seldom by much either way. What does happen is a LOT of storage I/O and CPU time being spent.

Please create the pull requiest Bwy.

Oh... just one nasty thought... all the existing data is going to need to be uncompressed or it will be gargage when read. THAT IS NOT A REASON NOT TO DO THIS. Even if no tool is available, it would be better to restart from scratch now than later (and honestly, few have much more than a couple hundred GB by now)

Nah, leveldb stores data with a flag saying if it is compressed or not. When the data goes to be read it checks the flag. So any current data that is stored will simply need to be decompressed when read from disk. New data however, will be stored as is with the compressed flag not set.

Excellent news! Please post that pull request!

From the leveldb doc under Performance:

Compression

Each block is individually compressed before being written to persistent storage. Compression is on by default since the default compression method is very fast, and is automatically disabled for uncompressible data. In rare cases, applications may want to disable compression entirely, but should only do so if benchmarks show a performance improvement:

It sounds like once compression is applied the data is checked to see if it is in fact smaller. If it is not then the compressed data is discarded (that's likely what's happening when storing the encrypted shards).

This makes me think, unless there is a log for this event where the compressed data is discarded, the only way to do a proper test is to take encrypted files and test compressing them with Snappy.

From Snappy README:

Typical compression ratios (based on the benchmark suite) are about 1.5-1.7x
for plain text, about 2-4x for HTML, and of course 1.0x for JPEGs, PNGs and
other already-compressed data
. Similar numbers for zlib in its fastest mode
are 2.6-2.8x, 3-7x and 1.0x, respectively.

After learning a bit more about Snappy, I can say with pretty high confidence that it cannot compress encrypted data. The reason is it uses simple pattern replacements with less complexity than deflate or gzip, hence why it cannot compress already compressed data.

Anyway if we want to test this, it's getting easier as we just need simply need to create a test program that outputs a Snappy compressed file from an input, then run that over AES-256-CTR encrypted files.

Here's a simple test using gzip that should have better compression than Snappy:

# ls -l
-rwxr-xr-x 1 root root 95376577 Oct  5 23:04 bob.mkv

# openssl aes-256-ctr -in bob.mkv -out bob.enc
# ls -l
-rw-r--r-- 1 root root 95376593 Oct 19 01:52 bob.enc
-rwxr-xr-x 1 root root 95376577 Oct  5 23:04 bob.mkv

# hexdump bob.mkv | head
0000000 451a a3df 4293 8882 616d 7274 736f 616b
0000010 8742 0181 8542 0181 5318 6780 0001 0000
0000020 af05 9d54 4d11 749b 4dc0 8cbb ab53 1584
0000030 a949 5366 82ac 0310 bb4d 538c 84ab 5416
0000040 6bae ac53 1082 4d9f 8ebb ab53 1184 9b4d
0000050 5374 84ac af05 8550 bb4d 538e 84ab 531c
0000060 6bbb ac53 0584 4faf eca7 bb4f 0000 0000
0000070 0000 0000 0000 0000 0000 0000 0000 0000
*
0001020 0000 0000 0000 1500 a949 4066 2a96 b1d7

# hexdump bob.enc | head
0000000 6153 746c 6465 5f5f ab42 4ba9 baca c3a5
0000010 5f26 4883 dd4b dca7 8102 6491 2a62 1306
0000020 ad63 af45 8ed3 7a28 6712 e83b f8b2 6814
0000030 50dc 4e95 5b0f 67c5 43c3 203a 3638 8118
0000040 468d b8f1 1696 4682 b2fd f1ee 1794 0b48
0000050 4f06 3958 96a3 8820 85d0 68a2 f569 0804
0000060 903c 4633 a1e8 4eb3 17cf 0cc8 dedf 0ae9
0000070 f026 7d58 e079 7468 6df1 33f4 f34d f7e3
0000080 101a bc53 40c1 cc6c 91ee b01c a0dd 0bf0
0000090 3fc1 806d 2249 b1ef 2f7f 626d 8d34 bc2f

# gzip -k bob.mkv
# gzip -k bob.enc
# ls -l
-rw-r--r-- 1 root root 95376593 Oct 19 01:52 bob.enc
-rw-r--r-- 1 root root **95391809** Oct 19 01:52 bob.enc.gz
-rwxr-xr-x 1 root root 95376577 Oct  5 23:04 bob.mkv
-rwxr-xr-x 1 root root 94782053 Oct  5 23:04 bob.mkv.gz

# openssl aes-256-ctr -in bob.mkv.gz -out bob.mkv.gz.enc
# ls -l
-rw-r--r-- 1 root root 95376593 Oct 19 01:52 bob.enc
-rw-r--r-- 1 root root 95391809 Oct 19 01:52 bob.enc.gz
-rwxr-xr-x 1 root root 95376577 Oct  5 23:04 bob.mkv
-rwxr-xr-x 1 root root 94782053 Oct  5 23:04 bob.mkv.gz
-rw-r--r-- 1 root root **94782069** Oct 19 02:05 bob.mkv.gz.enc

When encrypting first the compressed file ends up being 15,216 bytes larger than the encrypted file. This test meets the expect results.

If we encrypt after compression we see that the resulting encrypted file is 594,508 bytes smaller than the original.

So my recommendation is to disable compression on the leveldb side, but to add compression to the storj-cli side before encrypting (and also making sure that the compressed data is actually smaller). If you add compression to the client side, then we can use more advanced compression like deflate or gzip.

Agreed. Cycles are cheap for the sender doing one thing at a time, and this will enable them to better use their (typically) limited upload bandwidth.