Author: | Francesc Alted |
---|---|
Contact: | francesc@blosc.org |
URL: | http://www.blosc.org |
Blosc [1] is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations.
It uses the blocking technique (as described in [2]) to reduce activity on the memory bus as much as possible. In short, this technique works by dividing datasets in blocks that are small enough to fit in caches of modern processors and perform compression / decompression there. It also leverages, if available, SIMD instructions (SSE2) and multi-threading capabilities of CPUs, in order to accelerate the compression / decompression process to a maximum.
Blosc is actually a metacompressor, that meaning that it can use a range of compression libraries for performing the actual compression/decompression. Right now, it comes with integrated support for BloscLZ (the original one), LZ4, LZ4HC, Snappy and Zlib. Blosc comes with full sources for all compressors, so in case it does not find the libraries installed in your system, it will compile from the included sources and they will be integrated into the Blosc library anyway. That means that you can trust in having all supported compressors integrated in Blosc in all supported platforms.
You can see some benchmarks about Blosc performance in [3]
Blosc is distributed using the MIT license, see LICENSES/BLOSC.txt for details.
[1] | http://www.blosc.org |
[2] | http://blosc.org/docs/StarvingCPUs-CISE-2010.pdf |
[3] | http://blosc.org/trac/wiki/SyntheticBenchmarks |
Blosc is not like other compressors: it should rather be called a meta-compressor. This is so because it can use different compressors and pre-conditioners (programs that generally improve compression ratio). At any rate, it can also be called a compressor because it happens that it already integrates one compressor and one pre-conditioner, so it can actually work like so.
Currently it comes with support of BloscLZ, a compressor heavily based on FastLZ (http://fastlz.org/), LZ4 and LZ4HC (http://fastcompression.blogspot.com.es/p/lz4.html), Snappy (https://github.com/google/snappy) and Zlib (http://www.zlib.net/), as well as a highly optimized (it can use SSE2 instructions, if available) Shuffle pre-conditioner (for info on how it works, see slide 17 of http://www.slideshare.net/PyData/blosc-py-data-2014). However, different compressors or pre-conditioners may be added in the future.
Blosc is in charge of coordinating the compressor and pre-conditioners so that they can leverage the blocking technique (described above) as well as multi-threaded execution (if several cores are available) automatically. That makes that every compressor and pre-conditioner will work at very high speeds, even if it was not initially designed for doing blocking or multi-threading.
Other advantages of Blosc are:
- Meant for binary data: can take advantage of the type size meta-information for improved compression ratio (using the integrated shuffle pre-conditioner).
- Small overhead on non-compressible data: only a maximum of (16 + 4 * nthreads) additional bytes over the source buffer length are needed to compress every input.
- Maximum destination length: contrarily to many other compressors, both compression and decompression routines have support for maximum size lengths for the destination buffer.
When taken together, all these features set Blosc apart from other similar solutions.
The minimal Blosc consists of the next files (in blosc/ directory):
blosc.h and blosc.c -- the main routines shuffle.h and shuffle.c -- the shuffle code blosclz.h and blosclz.c -- the blosclz compressor
Just add these files to your project in order to use Blosc. For information on compression and decompression routines, see blosc.h.
To compile using GCC (4.4 or higher recommended) on Unix:
$ gcc -O3 -msse2 -o myprog myprog.c blosc/*.c -Iblosc -lpthread
Using Windows and MINGW:
$ gcc -O3 -msse2 -o myprog myprog.c -Iblosc blosc\*.c
Using Windows and MSVC (2010 or higher recommended):
$ cl /Ox /Femyprog.exe /Iblosc myprog.c blosc\*.c
In the examples/ directory you can find more hints on how to link your app with Blosc.
A simple usage example is the benchmark in the bench/bench.c file. Another example for using Blosc as a generic HDF5 filter is in the hdf5/ directory.
I have not tried to compile this with compilers other than GCC, clang, MINGW, Intel ICC or MSVC yet. Please report your experiences with your own platforms.
The official cmake files (see below) for Blosc try hard to include support for LZ4, LZ4HC, Snappy, Zlib inside the Blosc library, so using them is just a matter of calling the appropriate blosc_set_compressor() API call. See an example here.
Having said this, it is also easy to use a minimalistic Blosc and just add the symbols HAVE_LZ4 (will include both LZ4 and LZ4HC), HAVE_SNAPPY and HAVE_ZLIB during compilation as well as the appropriate libraries. For example, for compiling with minimalistic Blosc but with added Zlib support do:
$ gcc -O3 -msse2 -o myprog myprog.c blosc/*.c -Iblosc -lpthread -DHAVE_ZLIB -lz
In the bench/ directory there a couple of Makefile files (one for UNIX and the other for MinGW) with more complete building examples, like switching between libraries or internal sources for the compressors.
Blosc can also be built, tested and installed using CMake. Although this procedure is a bit more invloved than the one described above, it is the most general because it allows to integrate other compressors than BloscLZ either from libraries or from internal sources. Hence, serious library developers should use this way.
The following procedure describes the "out of source" build.
Create the build directory and move into it:
$ mkdir build
$ cd build
Now run CMake configuration and optionally specify the installation directory (e.g. '/usr' or '/usr/local'):
$ cmake -DCMAKE_INSTALL_PREFIX=your_install_prefix_directory ..
CMake allows to configure Blosc in many different ways, like prefering internal or external sources for compressors or enabling/disabling them. Please note that configuration can also be performed using UI tools provided by CMake (ccmake or cmake-gui):
$ ccmake .. # run a curses-based interface
$ cmake-gui .. # run a graphical interface
Build, test and install Blosc:
$ make
$ make test
$ make install
The static and dynamic version of the Blosc library, together with header files, will be installed into the specified CMAKE_INSTALL_PREFIX.
Once you have compiled your Blosc library, you can easily link your apps with it as shown in the example/ directory.
The CMake files in Blosc are configured to automatically detect other compressors like LZ4, LZ4HC, Snappy or Zlib by default. So as long as the libraries and the header files for these libraries are accessible, these will be used by default. See an example here.
Note on Zlib: the library should be easily found on UNIX systems, although on Windows, you can help CMake to find it by setting the environment variable 'ZLIB_ROOT' to where zlib 'include' and 'lib' directories are. Also, make sure that Zlib DDL library is in your 'Windows' directory.
However, the full sources for LZ4, LZ4HC, Snappy and Zlib have been included in Blosc too. So, in general, you should not worry about not having (or CMake not finding) the libraries in your system because in this case, their sources will be automaticall compiled for you. That effectively means that you can be confident in having a complete support for all the supported compression libraries in all supported platforms.
If you want to force Blosc to use the included compression sources instead of trying to find the libraries in the system first, you can switch off the PREFER_EXTERNAL_COMPLIBS CMake option:
$ cmake -DPREFER_EXTERNAL_COMPLIBS=OFF ..
You can also disable support for some compression libraries:
$ cmake -DDEACTIVATE_SNAPPY=ON ..
If you run into compilation troubles when using Mac OSX, please make sure that you have installed the command line developer tools. You can always install them with:
$ xcode-select --install
Blosc has an official wrapper for Python. See:
https://github.com/Blosc/python-blosc
Blosc can be used from command line by using Bloscpack. See:
https://github.com/Blosc/bloscpack
For those that want to use Blosc as a filter in the HDF5 library, there is a sample implementation in the hdf5/ directory.
There is an official mailing list for Blosc at:
blosc@googlegroups.com http://groups.google.es/group/blosc
I'd like to thank the PyTables community that have collaborated in the exhaustive testing of Blosc. With an aggregate amount of more than 300 TB of different datasets compressed and decompressed successfully, I can say that Blosc is pretty safe now and ready for production purposes.
Other important contributions:
- Valentin Haenel did a terrific work implementing the support for the Snappy compression, fixing typos and improving docs and the plotting script.
- Thibault North, with ideas from Oscar Villellas, contributed a way to call Blosc from different threads in a safe way. Christopher Speller introduced contexts so that a global lock is not necessary anymore.
- The CMake support was initially contributed by Thibault North, and Antonio Valentino and Mark Wiebe made great enhancements to it.
- Christopher Speller also introduced the two new '_ctx' calls to avoid the use of the blosc_init() and blosc_destroy().
Enjoy data!