This research project is generously supported by the NSERC USRA, and supervised by Dr. Oana Balmau. It is built upon existing work on RocksDB developed by Facebook.
In our modern era of technology, the goal of most (if not all) online platforms is to provide a customized experience to their users, by suggesting relevant items to adjust to their preferences. These recommendation systems ingest enormous amounts of information at a very high rate, and require a datastore that can simultaneously provide:
- High write throughput to ingest the incoming user events.
- High read throughput and low read latency to efficiently merge new information with the old data.
- Persistent storage, since the amount of incoming data is large and is growing continuously.
It is achieving all three objectives simultaneously that is a challenge.
To address the challenge of designing an optimal recommendation system that satisfies the three requirements (high write throughput, high read throughput, persistent storage), this project will build upon Facebook's existing RocksDB, which uses log-structured merge trees. LSM trees are a popular data structure for write-optimized, persistent NoSQL datastores, but are not optimized for reads and provide little support for complex data structures like those in recommendation systems.
This project aims to augment the LSM tree that RocksDB uses, to support real-time recommendations on fresh data, and meet the three requirements of an effective modern recommendation system.
To simulate a recommendation system for a social media platform, there are four "maps" (distinguished using different column families in RocksDB), which represent the different key-value pairs that store the data generated during a benchmark:
Map | Key | Value* |
---|---|---|
1) Picture Annotations | ImageID | Set[Annotation] |
2) Picture User Time Series | UserID | Map{ImageID → Map{ActionID → (List[TimeStamp], Counter)}} |
3) User Annotation Time Series | UserID | Map{AnnotationID → Map{ActionID → (List[TimeStamp], Counter)}} |
4) User Annotation Scores | (UserID, AnnotationID) | Map{ActionID → Score} |
*In Phase 1, these maps are represented using RocksDB's Slice
structure.
The data of interest collected from the benchmark is:
- Average throughput of using all maps
- Average read/write latency of all maps
- 50th and 99th percentile read/write latency of all maps
The following is an example figure generated using the benchmark:
To augment the underlying LSM tree, RocksDB needs to be modified to support data structures other than just the Slice
. The benchmark will then be re-run using these new data structures to compare the read/write performance against that of using a Slice
.
Currently in progress!
The following table highlights the files that have been modified/created for this project and their purpose.
File | Status | Description |
---|---|---|
tools/db_bench_tool.cc |
Modified | Implemented custom benchmark to simulate social media workload. |
util/json_serializer.cc |
Created* | Created functions to stringify all four custom maps. |
util/zipf.cc ** |
Created* | Created functions to generate random number between specified range using Zipfian distribution. |
CMakeList.txt + src.mk |
Modified | Added created files to enable Makefile to target. |
* Corresponding header files have also been created for these files.
** Original author: Oana Balmau (link to code).