leveldb: A key-value store Authors: Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com) The original Google README is now README.GOOGLE. This repository contains the Google source code as modified to benefit the Riak environment. The typical Riak environment has two attributes that necessitate leveldb adjustments, both in options and code: - production servers: Riak often runs in heavy Internet environments: servers with many CPU cores, lots of memory, and 24x7 disk activity. Basho's leveldb takes advantage of the environment by adding hardware CRC calculation, increasing Bloom filter accuracy, and defaulting to integrity checking enabled. - multiple databases open: Riak opens 8 to 64 databases simultaneously. Google's leveldb supports this, but its background compaction thread can fall behind. leveldb will "stall" new user writes whenever the compaction thread gets too far behind. Basho's leveldb modification include multiple thread blocks that each contain prioritized threads for specific compaction activities. This repository follows the Basho standard for branch management as of November 28, 2013. The standard is found here: https://github.com/basho/riak/wiki/Basho-repository-management In summary, the "develop" branch contains the most recently reviewed engineering work. The "master" branch contains the most recently released work, i.e. distributed as part of a Riak release. Those wishing to truly savor the benefits of Basho's modifications need to initialize a new leveldb::Options structure similar to the following before each call to leveldb::DB::Open: leveldb::Options * options; options=new Leveldb::Options; options.filter_policy=leveldb::NewBloomFilterPolicy2(16); options.block_cache=leveldb::NewLRUCache(536870912); options.max_open_files=500; options.write_buffer_size=62914560; options.env=leveldb::Env::Default(); Memory usage by leveldb is largely controlled by three leveldb::Options variables: write_buffer_size, max_open_files, and size used to initial block_cache. - write_buffer_size: Riak randomizes the write_buffer_size across its multiple databases between 30Mbytes and 60Mbytes. Internal, leveldb could have two of these active for an open database. Google's default is 4Mbyte. Basho's leveldb has been tuned to the 30Mbyte to 60Mbyte range, but does not block larger values (though larger values are just untested). - max_open_files: Riak 1.4 simplified calculating the memory impact of max_open_files. Basho's leveldb now simply multiplies max_open_files by 4Mbytes to establish the file cache limit. The file cache is therefore really a limit on many files' metadata can fit. max_open_files was originally a file handle limit, but file metadata is now a much larger resource concern. The impact of Basho's larger table files and Google's recent addition of Bloom filters made the accounting switch necessary. Where server resources allow, the value of max_open_files should exceed the count of .sst table files within the database directory. Memory budgeting for this variable is more important to random read operations than the block_cache. - block_cache: The block cache holds data blocks that leveldb has recently retrieved from .sst table files. Any given block contains one or more complete key/value pairs. The cache speeds up repeat access to the same key and potential access to adjacent keys. Basho has tested this as high as 2Gbytes. (Note: the older leveldb released in Riak 1.2 had a bug that hurt performance with values above 8Mbytes … that was fixed in the Riak 1.3 release) So estimated memory usage of a database = write_buffer_size*2 + max_open_files*4194304 + (size in NewLRUCache initialization)