/nebula

A distributed block-based data storage and compute engine

Primary LanguageC++Apache License 2.0Apache-2.0

Nebula

Extremely-fast Interactive Real-Time Analytics

logo
Nebula is an extremely-fast end-to-end interactive big data analytics solution. Nebula is designed as a high-performance columnar data storage and tabular OLAP engine.

What is Nebula?

  • Extreme Fast Data Analytics System with Access Control.
  • Distributed Cache Tier for Tabular Data.
  • Build Unified Service API for any Sources (files, streaming, services, etc.)

Nebula can run on

  • Local box
  • VM cluster
  • Kubenettes

Documents of design, internals and stories will be shared at project docs (under-construction).

A Simple Story

To cut it short, check this story and see if it's interesting to you:

  1. You have some data, they are files on cloud storage, or streaming (eg. kafka), or even just a bunch of CSV files on Github, pretty much any source...
  2. You deploy a Nebula cluster, it is either single box, a cluster of a few EC2 machines on AWS, or just a Kubenettes cluster. Nebula doesn't have external dependencies, just a couple binaries (or docker images), so it's easy to maintain.
  3. Now, you add a table defintion in the cluster config file. Right away, you have these available:
    • A web UI where you can slice/dice your data for interactive visualization. You can also write script to transform your data in server side.
    • A REST API that you can build your own application with.

Highlight - visualize your real-time streaming from Kafka

demo

Sounds interesting? Continue to read...

Contents

Introduction

With Nebula, you could easily:

pretty chart 1

Transform column, aggregate by it with filters

  • To learn more, check out these resources:
  1. 10 minutes quick tutorial video

  2. Nebula presentation slides

Get Started

Run example instance with sample data on local

  1. clone the repo: git clone https://github.com/varchar-io/nebula.git
  2. build latest code: cd nebula && ./build.sh
  3. launch services: ./run.sh (the script uses test config file build/configs/test.yml which you can modify to connect your own data)
  4. explore nebula UI in browser if all up running: http://localhost:8088

Run example instance with sample data on Kubernetes

Deploy a single node k8s cluster on your local box. Assume your current kubectl points to the cluster, just run:

  • apply: kubectl apply -f deploy/k8s/nebula.yaml.
  • forward: kubectl port-forward nebula/server 8088:8088
  • explore: http://localhost:8088

Build Source & Test

The whole repo can be built on either MacOS or Linux. Just run ./build.sh.

After built the source successfully, the binaries can be found in ./build directory. Now you can launch a simple cluster of "server" + "one worker" + "web server" like this:

  • launch node: ~/nebula/build%./NodeServer
  • launch server: ~/nebula/build% ./NebulaServer --CLS_CONF configs/test.yml
  • launch web server: ~/nebula/src/service/http/nebula% NS_ADDR=localhost:9190 NODE_PORT=8081 node node.js`

If everything goes as expected, now you should be able to explore and query the sample data from its UI at http://localhost:8081

Birdeye View

Overview

Common Scenarios

As you may see in the previous section where we talk about running the sample locally. All of Nebula data tables are defined by a yaml section in the cluster config file, it's configs/test.yml in the example. Each of the use case demonstrated here is a table defintion, which you can copy to configs/test.yml and run it in that test. (Just replace the real values of your own data, such as schema and file location)

CASE-1: Static Data Analytics

Configure your data source from a permanent storage (file system) and run analytics on it. AWS S3, Azure Blob Storage are often used storage system with support of file formats like CSV, Parquet, ORC. These file formats and storage system are frequently used in modern big data ecosystems.

For example, this simple config will let you analyze a S3 data on Nebula

seattle.calls:
  retention:
    max-mb: 40000
    max-hr: 0
  schema: "ROW<cad:long, clearence:string, type:string, priority:int, init_type:string, final_type:string, queue_time:string, arrive_time:string, precinct:string, sector:string, beat:string>"
  data: s3
  loader: Swap
  source: s3://nebula/seattle_calls.10k.tsv
  backup: s3://nebula/n202/
  format: csv
  csv:
    hasHeader: true
    delimiter: ","
  time:
    type: column
    column: queue_time
    pattern: "%m/%d/%Y %H:%M:%S"

CASE-2: Realtime Data Analytics

Connect Nebula to real-time data source such as Kafka with data formats in thrift or JSON, and do real-time data analytics.

For example, this config section will ask Nebula to connect one Kafka topic for real time code profiling.

  k.pinterest-code:
    retention:
      max-mb: 200000
      max-hr: 48
    schema: "ROW<service:string, host:string, tag:string, lang:string, stack:string>"
    data: kafka
    loader: Streaming
    source: <brokers>
    backup: s3://nebula/n116/
    format: json
    kafka:
      topic: <topic>
    columns:
      service:
        dict: true
      host:
        dict: true
      tag:
        dict: true
      lang:
        dict: true
    time:
      # kafka will inject a time column when specified provided
      type: provided
    settings:
      batch: 500

CASE-3: Ephemeral Data Analytics

Define a template in Nebula, and load data through Nebula API to allow data live for specific period. Run analytics on Nebula to serve queries in this ephemeral data's life time.

CASE-4: Sparse Storage

Highly break down input data into huge small data cubes living in Nebula nodes, usually a simple predicate (filter) will massively prune dowm data to scan for super low latency in your analytics.

For exmaple, config internal partition leveraging sparse storage for super fast pruning for queries targeting specific dimension: (It also demonstrates how to set up column level access control: access group and access action for specific columns)

  nebula.test:
    retention:
      # max 10G RAM assigment
      max-mb: 10000
      # max 10 days assignment
      max-hr: 240
    schema: "ROW<id:int, event:string, tag:string, items:list<string>, flag:bool, value:tinyint>"
    data: custom
    loader: NebulaTest
    source: ""
    backup: s3://nebula/n100/
    format: none
    # NOTE: refernece only, column properties defined here will not take effect
    # because they are overwritten/decided by definition of TestTable.h
    columns:
      id:
        bloom_filter: true
      event:
        access:
          read:
            groups: ["nebula-users"]
            action: mask
      tag:
        partition:
          values: ["a", "b", "c"]
          chunk: 1
    time:
      type: static
      # get it from linux by "date +%s"
      value: 1565994194

SDK: Nebula Is Programmable

Through the great projecct QuickJS, Nebula is able to support full ES6 programing through its simple UI code editor. Below is an snippet code that generates a pie charts for your SQL-like query code in JS.

On the page top, the demo video shows how nebula client SDK is used and tables and charts are generated in milliseconds!

    // define an customized column
    const colx = () => nebula.column("value") % 20;
    nebula.apply("colx", nebula.Type.INT, colx);

    // get a data set from data set stored in HTTPS or S3
    nebula
        .source("nebula.test")
        .time("2020-08-16", "2020-08-26")
        .select("colx", count("id"))
        .where(and(gt("id", 5), eq("flag", true)))
        .sortby(nebula.Sort.DESC)
        .limit(10)
        .run();

Open source

Open source is wonderful - that is the reason we can build software and make innovations on top of others. Without these great open source projects, Nebula won't be possible:

Many others are used by Nebula:

  • common tools (glog/gflags/gtest/yaml-cpp/fmt/leveldb)
  • serde (msgpack/rapidjson/rdkafka)
  • algos(xxhash, roaring bitmap, zstd, lz4)
  • ...

Adoptions

Pinterest