- disk write speed storage service that returns offsets to stored values
- if you are ok with losing some data (does not fsync on write)
- supports: append, multiappend, modify, get, multiget, close, query, compact
- clients: go, java
run with docker: jackdoe/rochefort:2.5
docker run -e BIND=":8000" \
-e ROOT="/tmp/rochefort" \
-p 8000:8000 \
jackdoe/rochefort:2.5
- added 4 more bytes in the header
- the -buckets parameter is gone, so everything is appended in one file per namespace
you can migrate your data by doing:
oldServer.scan(namespace: ns) do |offset, v|
newServer.append(namespace:ns, data: v)
end
- moved get/multiget/append to protobuf
- moved delete/close to protobuf
- root: root directory, files will be created at
root/namespace||default/append.raw
- bind: address to bind to (default :8000)
dont forget to mount persisted root directory
$ go run main.go query.go input.pb.go -bind :8000 -root /tmp
2018/02/10 12:06:21 starting http server on :8000
....
res, err := r.Set(&AppendInput{
AppendPayload: []*Append{{
Namespace: ns,
Data: []byte("abc"),
AllocSize: 10, // so you can do inplace modification
Tags: []string{"a","b","c"} // so you can search it
}, {
Namespace: ns,
Data: []byte("zxc"),
}},
})
you can always do inplace modifications to an object, and you can also reserve some space to add more stuff to the same offset later
the searchable tags are sanitized as all non alphanumeric characters(excluding _) [^a-zA-Z0-9_]+
are removed
passing tags a,b,c will create postings lists in the namespace a.postings, b.postings and c.postings, later you can query only specific tags with /query
_, err = r.Set(&AppendInput{
ModifyPayload: []*Modify{{
Namespace: ns,
Offset: off,
Pos: 1,
Data: []byte("zxcv"),
}},
})
inplace modifies position, for example if we want to replace 'abc' with 'szze' in the blob we appended at offset 0, we modify rochefort offset 0 with 'zz' from position 1 If you pass Pos: -1 it will append to the previous end of the blob
in AppendInput you can mix modify and append commands
fetched, err := r.Get(&GetInput{
GetPayload: []*Get{{
Namespace: "example",
Offset: offset1,
}, {
Namespace: "example,
Offset: offset12,
}},
})
output is GetOutput which is just array of arrays of byte, so fetched[0] is array of bytes holding the first blob and fetched[1] is the second blob
you can also pass "namespace" parameter and this will create different directories per namespace, for example
namespace: events_from_20171111
namespace: events_from_20171112
will crete {root_directory}/events_from_20171111/... and {root_directory}/events_from_20171112/...
and then you simply delete the directories you don't need (after closing them)
Closes a namespace so it can be deleted (or you can directly delete it with DELETE)
header is 16 bytes
D: data length: 4 bytes
R: reserved: 8 bytes
A: allocSize: 4 bytes
C: crc32(length, time): 4 bytes
V: the stored value
DDDDRRRRRRRRAAAACCCCVVVVVVVVVVVVVVVVVVVV...DDDDRRRRRRRRAAAACCCCVVVVVV....
as you can see the value is not included in the checksum, I am
checking only the header as my usecase is quite ok with
missing/corrupting the data itself, but it is not ok if corrupted
header makes us allocate 10gb in output := make([]byte, dataLen)
scans the file
$ curl http://localhost:8000/scan?namespace=someStoragePrefix > dump.txt
the format is [len 4 bytes(little endian)][offset 8 bytes little endian)]data...[len][offset]data
you can search all tagged blobs, the dsl is fairly simple, post/get json blob to /query
- basic tag query
{"tag":"xyz"}
- basic OR query
{"or": [... subqueries ...]}
- basic AND query
{"and": [... subqueries ...]}
example:
curl -XGET -d '{"and":[{"tag":"c"},{"or":[{"tag":"b"},{"tag":"c"}]}]}' 'http://localhost:8000/query'
it spits out the output in same format as /scan, so the result of the query can be very big but it is streamed
MIT
Rochefort Trappistes 10 is my favorite beer and I was drinking it while doing the initial implementation at sunday night
You can lose data on crash and there is no replication, so you have to orchestrate that yourself doing double writes or something.
The super simple architecture allows for all kinds of hacks to do backups/replication/sharding but you have to do those yourself.
My usecase is ok with losing some data, and we dont have money to pay for kafka+zk+monitoring(kafka,zk), nor time to learn how to optimize it for our quite big write and very big multi-read load.
Keep in mind that there is some not-invented-here syndrome involved into making it, but I use the service in production and it works very nice :)
there is race between reading and modification from the client prespective
- travis-ci
- perl client
- make c client that can be used from ruby/perl
- javadoc for the java client
- publish the java client on maven central