A light-weight server for querying immutable data stored in HFiles.
Quiver provides a low-latency server that can be run directly top of large, generated datasets, such as those commonly output by offline / "big data" pipelines, as an alternative to trying to bulk-load their output back into a mutable databases.
Quiver uses Thrift-RPC-over-HTTP to communicate - standard Thrift RPC calls are encoded and sent as HTTP request/response bodies. This allows any off-the-shelf http tools (eg HAProxy) to interact with this thrift-RPC traffic.
HFiles are designed to be written incrementally (metadata is in a "trailer" at the end rather than in a header, so you do not have to buffer the whole dataset while writing) -- and include an index, meaning they can be mapped into memory and used to answer queries quickly "as-is", without needing to build indexes at serving time.
The actual implementation of the HFile file interaction (finding keys, iterating, etc) is implemented in a standalone library.
Quiver can be started with collections to serve specified on the command-line, or can be configured to read a (json-encoded) list of collections to serve from a URL.
Collections are specified on the commandline as servedAs=path/to/file
pairs.
servedAs
is the name for the collection (clients include the name of the collection they are querying in requests).
Optionally, during generation, larger collections may be broken into smaller "shards" served by different quiver servers. In that case, servedAs
can be of the form collection/sharding-function/partition-count/partition
, in which case the "name" will be collection/partition
, while the other fields are used for service discovery registration.
./quiver demo=path/to/demo.hfile bigcol/mod_first_byte/40/4=path/to/bigcol/part4.hfile
Rather than individually specifying collection information and paths on the command line, the URL to a json document containing a list of collection configs can be provided. Each config should specify:
url
(string) the location of the hfile to serve.collection
(string)partition
(int)capacity
(int) total partitionsfunction
(string)ondemand
(bool) if the hfile does not need to be locked into memory.
The servedAs
name when loading configuration from json is always collection/partition
.
cmd/load
is a small utility to load test or compare two running quiver servers.
It asks the server for a random sample of keys and then generates traffic at the specified QPS, generating requests from those keys.
It can optionally send the same request to a second server and compare the results, printing a warning if they returned different responses and logging timing information, and can also write timings and diff counts to graphite.
You can go get github.com/foursquare/quiver/cmd/load
to install it in your $GOPATH/bin
, or can just go build
in the cmd/load
directory and run the resulting binary.
Quiver is tested and developed assuming Go 1.5 and $GOPATH/bin
is on PATH.
Foursquare engineers: add this to your bashrc
The c-snappy dependency uses -sdt=c++11
, but older gcc versions expect -std=c++0x
.
Override CXX
, CPP
and CC
to point to a newer GCC if your system version is too old.
Foursquare engineers: add this to your bashrc
Quiver uses goimports
formatting (which groups stdlib imports).
-
Install
goimports
:go get golang.org/x/tools/cmd/goimports
-
Fix up files:
goimports -w *.go
When making changes, keep an eye on performance changes, particularly memory allocation:
go test -bench . -benchtime 5s -test.benchmem
Obviously raw performance varies across machines, so run with and without your change (using git stash
or branches) and compare results. Include before-and-after output in commit messages for performance-related changes.
For reference the results on a relatively high-spec mid-2014 Macbook Pro:
BenchmarkUncompressed-8 500000 13806 ns/op 544 B/op 13 allocs/op
BenchmarkCompressed-8 300000 25106 ns/op 623 B/op 18 allocs/op
Use ./regen.sh
to re-generate the thrift code after making changes to quiver.thrift
.
- David Taylor (maintainer)