BLepers/KVell

some questions about the implementation

Opened this issue · 4 comments

Hi,

It’s really an excellent work. And I am also working on a new kv store with no commit-log, fully asynchronous, no gc and share-nothing features based on spdk. But when I read the paper and all the source code you have pushed, I find something puzzling.

  1. In section 5.3, about the user page cache, the paper says that “but B tree data is allocated from an mmap-ed file and can be paged out by the kernel”, but in the actual implementation, the page cache is just allocated by malloc in memory. So i am puzzled about whether the in-memory B tree index is reconstructed effectively enough as the evaluation shown. Because when the system is restarted each time, kvell has to scan all the disk data to construct the B tree index.

  2. As you said in section 5.2 that “Items larger than 4K have a timestamp header at the beginning of each block on disk”, but actually, the implementation only involves items no more than 4K in the source code. In fact, when an item size is larger than 4K, the atomic writing cannot be guaranteed, as modern NVME devices only provide block-granularity atomic writing semantics. under such case, the commit-log seems to be essential to guarantee the atomicity and the durability. So can a no-commit-log kv store be implemented when the item size larger than 4K?

  3. No-gc is an important design feature in kvell, but from the implementation, when lots of items with the same size are put in kvell, the size of the corresponding slab file will increase. Under the case, when the items are deleted, the space in that slab will not be shrunk. So I think gc is still needed to reclaim the deleted space.

  4. Each slab file stores fixed size kv items, that is, when the size of an item is less than the slab slot size, kvell has to pad the item. So i am really wandering what is the best slot size setting. The values given by kvell are “size_t slab_sizes[] = { 100, 128, 256, 400, 512, 1024, 1365, 2048, 4096 }”. Is there any deep reason for the design?

Look forward to your kind reply.

Thanks, good questions :)

  1. Yes, indeed, the code on github doesn't fully reflect that part of the paper. Two points here:

    1.1 We did some experiments with the B Tree using mmap to allocate memory for cells instead of the standard malloc/new. The memory is never allocated from our user page cache, we used a regular mmap (so allocated from the kernel directly). To do that, we just modified the memory allocator used by the B tree, but the code was too messy, so I didn't include it. (Performance is anyway very bad when the index doesn't fit in memory.)

    1.2. Even when the B Tree is mmaped and not malloced, we still have to scan the whole store to rebuild the index in case of a crash. The mmap is just there to tell the kernel that it is OK to swap out some pages, but we have no guarantee that the index data on disk is up to date. So, in all cases, we scan the store for recovery. The numbers reported in the paper correspond to the time it takes to scan the full store (100GB at 15GB/s = fast recovery, so there is no point in trying to have an up to date version of the index disk, it is way more costly than scanning the store :) ).

  2. If you use a header for each 4K block, you can guarantee atomicity using timestamps. E.g.:
    block1 = [rdt1, data]
    block2 = [rdt2, data]
    if(rdt1 == rdt2) data is correct. You just have to reconstruct the item when reading.

    Another option is to add a the rdt at the beginning and end, and force the end to be written after the beginning. E.g.: for an item that spawns N blocks
    write blocks 1-(N-1) [rdt, data...............]
    wait for the writes to be done
    write block N [...end of data, rdt]
    Then it is easy to check if the whole item has been written by comparing the rdt in block 1 and N.

  3. No need for a GC. We could use a similar technique as what TCMalloc uses. E.g.: count the number of free spots in a 2MB chunk of the file. When the chunk has no item, it can be given back to the filesystem (there is a syscall to free a portion of a file). We can maintain the count in memory (1 int per 2MB is small). So a simple refcount is enough, no need for a GC. Obviously, in the worst case, it wastes a lot of space, but it is extremely unlikely (it works for memory allocators, so no reason that it wouldn't work here).

  4. Current values are chosen very randomly. But you can choose the values to minimize wasted space. The kernel does that for its slab allocator (values are chosen to match sizes frequently used by the kernel).

Thanks for your reply,

  1. (1.2) For the finding that “100GB at 15GB/s = fast recovery, so there is no point in trying to have an up to date version of the index disk, it is way more costly than scanning the store“. I notice that the max bandwidth is 1.6GB/s for Amazon-8NVMe config, in which kvell uses it to test the recovery performance. In the evaluation(section 6.6), kvell takes 6.6s to scan the whole database, about 100GB size. That is, the bandwidth is about 15Gb/s. That may be a little confusing, since it needs to take at least 62.5s to load the whole database into memory with max disk bandwidth. One possible explanation is that the evaluation does not write the whole 100GB database into the disk when you simulate a crash by killing the database in the middle of a YCSB A workload. However, the evaluation in the paper does not tell us the actual data size the simulation has written into disk when the evaluation starts recovering from crash for all three test cases. And also, the evaluation may not guarantee a recovering case from the same size of (already written) database for all tree tests. So May I know the actual recovering time for the whole 100GB database for kvell?

  2. For atomicity, a data writing shall be successful, that is the old data item is replace by new data, or failed, that is the old data keeps intact. In fact, both options you have mentioned may not guarantee such principle. eg.:
    block1 = [rdt1, data]
    block2 = [rdt2, data]
    if(rdt1 == rdt2) data is correct. You just have to reconstruct the item when reading.
    A counterexample is that a data writing for an item with size of 2 blocks may successfully flushes the new data into block1 but fails to flushes it into block2 because of a crash or a hardware fault. that is, the old data item is damaged. Under such case, commit-log may be essential.

  3. That is a good idea to count the number of free spots in a 2MB chunk of the file. But as far as I am concerned, removing a portion of a file is quite hard, which looks like costlier than compaction. Since if one chooses to free a portion of a file, he has to load the whole file into memory and delete the portion, then dumps the new data into a new file. Even though, the sparse-file feature is supported in most of file systems, it is quite limited and cannot be used to free a portion of a file.

  1. The BW is ~2GB/s per disk so ~16GB/s in total. So it takes 6.25s to scan 100GB.

  2. For updates > 4KB you don't write in place. During recovery, you might have the 2 items on disk but you can use the rdt to only keep the newest version. Doing the updates not in place is basically free thanks to the in-memory freelist (we have a new version of KVell that handles that properly, we might release it at some point).

  3. FALLOC_FL_PUNCH_HOLE ;)

That is great. Look forward to the new version of KVell.
Thanks for your answers. :)