Blobkeeper is high-performance, reliable distributed file storage. It was built on top of netty and jgroups.
It's not yet production ready. Mostly two components of the system are not stable: a compaction service and merkle tree replication. Both of them are actively testing and, I believe, the stable 1.0 is not so far ;-)
User files are stored in the blobs which are pre-allocated in continuous space on the disk. A blob file has index stored in cassandra. The index structure is pretty simple:
CREATE TABLE BlobIndex (
id bigint,
type int,
created bigint,
updated bigint,
data blob,
deleted boolean,
disk int,
part int,
crc bigint,
offset bigint,
length bigint,
PRIMARY KEY (id, type)
);
id - unique generated file id
type - additional field to groups multiple files with single id (can be used to store thumbs)
data - metadata
disk and part - address a blob on a disk
offset and length - address a file in a blob
HTTP server handles requests. The upload request just upload file and put it to the writing queue. It will be written on a disk later. The queue is polled by a few workers. Each disk has a single writer worker. The read request gets the file index, reads the offset and length and streams the bytes from a file channel to a network socket w/o copy.
Server supports a replication. If the cluster has at least one slave, file will be copied after it has been written on the master. The replication is implemented on top of jgroups framework.
Additionally, the server has repair command. It used to sync blobs between replicas in case of a new slave has been added or a disk has been replaced.
To reduce traffic of repair process the server builds the merkle tree structure on top of the index. Then compares blobs and sends only missed parts. Find out more information (in russian) in my personal blog.
A compaction algorithm is dead simple. It has a few independent steps with minimal cross cluster synchronization operations:
- Find partitions with the significant percent of deleted space.
- Move files one by one from a partition is being deleted to the new one.
- Run a cluster-wide operation of physical deleting partition file from a disk.
Survived files are moved via the same writer queue, as uploaded files. So, it still has a single writer thread per an active partition per disk.
Deleted files have a gc grace time. This time used for checking of expiration. When a deleted file is expired, it will never be restored. So, a state will not change. Only expired files can be compacted.
The rebalancing gets ability add additional disks to an existing cluster and guarantee to keep writes are uniform.
Major steps of the balancing algorithm.
- Gather balancing information: how many partitions should be balanced for each disk.
- Randomly choose the oldest partition from a balancing disk and move a data, repeat on a cluster. When balancing of the partition is starting, the partition state is updated to REBALANCING. Then, the data has been moved the state changed to DATA_MOVED.
- After step two, the cluster has two copies of the data on a different disk on each node. The next step is the partition index update. One by one file index of the moved partition will be updated. Finally, all data and index will have been moved.
- The last step, mark the old partition as DELETED, in order to the compaction process finalize it.
It's possible to have multiple clusters to scale out writes. A file identifier format is inspired by Twitter's snowflake. The identifier supports up to 1024 shards.
upstream readers {
server 172.16.41.1:7890;
server 172.16.41.2:7890;
server 172.16.41.3:7890;
}
upstream writers {
server 172.16.41.1:7890;
}
listen *:80;
server_name filez-storage.tld;
location / {
proxy_next_upstream http_502;
if ($request_method ~* GET) {
proxy_pass http://readers;
}
if ($request_method !~* GET) {
proxy_pass http://writers;
}
}
}
A bunch of miscellaneous features:
- Authentication of individual file through auth tokens
- Delete/restore (restore will be a bit later)
- Admins API: repair, refresh, balance, get master, get nodes list (TBD)
- Compaction (cleanup deleted files from disk) - experimental
- Rebalancing - experimental
- Index cache
- Configure read only disks
- Improve documentation and tests
- Dockerization
- Public benchmarks
- File expiration (can be usefull for temp files)
- Smart data cache on SSD