/gumshoedb

A high performance in-memory OLAP database.

Primary LanguageGo

GumshoeDB

GumshoeDB is a database for quickly scouring hundreds of millions of analytics events and finding answers. It is has a similar design to the Dremel and Druid data stores.

It's a work-in-progress and not yet documented for public consumption. Details about typical use cases and key design choices are coming soon.

To run:

  • Clone this repo to the appropriate location in your $GOPATH
  • go build github.com/philc/gumshoedb/server
  • ./server

This starts a GumshoeDB daemon at localhost:9000.

GumshoeDB can be interacted with over HTTP. Test data can be imported into the database with a PUT request:

curl -iX PUT 'localhost:9000/insert' -d '
[{"clicks": 1, "age": 21, "name": "Starbuck", "country": "USA"},
 {"clicks": 2, "age": 22, "name": "Boomer", "country": "USA"},
 {"clicks": 3, "age": 23, "name": "Helo", "country": "CAN"},
 {"clicks": 4, "age": 24, "name": "Apollo", "country": "DEU"}
 ]'

Here's a representative query, assuming the columns "country", "age", and "clicks".

curl -iX POST localhost:9000/query -d '
{
  "aggregates":[
      {"type": "sum", "name": "clicks", "column": "clicks"},
      {"type": "average", "name": "avgAge", "column": "age"}],
  "filters": [{"type": ">", "column": "age", "value": 20},
              {"type": "in", "column": "country", "value": ["USA", "CAN"]}],
  "groupings": [{"column": "country", "name":"country"}]
}
'

Results:
{
  "duration": 0,
  "results":
    [{"avgAge": 21.5, "clicks": 3, "country": "USA", "rowCount": 2},
     {"avgAge": 23, "clicks": 3, "country": "CAN", "rowCount": 1}]
}

See DEVELOPING.md for how to navigate the code and make changes.

Gumshoedb is licensed under the MIT license.

Data Model

A GumshoeDB database is logically similar a single table in a relational database: there is a schema, which specifies fixed columns, and there are many rows which follow that schema. There are two kinds of columns: dimensions and metrics. Dimensions are attributes of the data, and the values may be strings or numeric types. Metrics are numeric counts (floating-point or integer types).

When new data is inserted into GumshoeDB, each row must be associated with a timestamp. The data in a GumshoeDB database is grouped into sequential, non-overlapping time intervals (right now, one hour -- this will be configurable in the future). Queries will return data at this granularity.

Implementation

Internally, each interval in a GumshoeDB database is divided into size-bounded segments. Each segment is a flat block of bytes. This is a []byte in memory which may be backed by a file using mmap(2).

As data is inserted, each input row is inserted into the appropriate interval based on its timestamp. The timestamp is not stored with the data. Within the interval, rows are collapsed together if possible. This means that if two rows in the same interval have the same value for each dimension, then they are combined into a single row by summing the values of the metrics.

A segment is composed of many sequential rows. Each row is laid out using 8, 16, 32, or 64-bit slots according to the type of the column. The initial few bytes of the row contain a bit of metadata.

  • The first byte is the count column: it indicates how many input rows have been collapsed together to form that row.
  • After that, there are a few nil bytes: this is a bit vector for the dimensions columns in that row that indicates whether each column has a nil value. The number of nil bytes can be computed from the number of dimension columns in the schema.
  • Next come the dimension columns. String-valued columns use a separate dictionary of string to int and so all values in the row are numeric.
  • Finally, the metric columns are laid out.

Here is a picture of an example row:

[00000011][00000010][...d0...][.......d1.......][...d2...][...m0...][...m1...][...m2...]
   count     nil    <-------- dimension columns ---------><------ metric columns ------>
<------------------------------ table.RowSize (16) ------------------------------------>

In this example, the schema has 3 dimension columns and 3 metric columns. All the column types are 8 bits except for d1, a 16-bit type. The count is 3, so three input rows with the same values for the dimension columns were collapsed together to form this row. There is only one nil byte, and the only set bit is at position 1, so dimension column 1 (d1) is the only nil column.

Schema Changes

Use gumtool migrate to modify the schema of gumshoedb without losing data. Sample usage:

go build github.com/philc/gumshoedb/gumtool
./gumtool migrate -old-db-path=db -new-db-config=new_config.toml

gumtool migrate will add columns, delete columns, or increase column sizes. The behavior for decreasing column sizes (int32 -> int16) is currently undefined.

Distribution

There is a simple distribution mechanism implemented: the router. Build with

go build github.com/philc/gumshoedb/router

and run ./router -h for usage info. It needs a copy of the schema and a list of all the shards to which to route inserts and queries.

There is a tool, gumtool balance, which runs over SSH and reads databases on many shards and then partitions them into a new set of small databases which it SCPs to the destination shards. This is useful for rebalancing unevenly distributed shards, or consolidating data down to fewer shards. Build gumtool as above and then run gumtool balance -h for usage information.

Note that this requires gumtool to be present on every shard (see the -bindir arg of gumtool balance) because it uses another gumtool subcommand, gumtool merge, to do the final merge of multiple partial DBs into complete shard DBs.

Notes

There is a subtle bug where we are unable to run gumshoedb in vagrant using a db that is located in a shared directory on the host machine. See this stackoverflow question for details.

Contributors