/datafuse

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, written in Rust

Primary LanguageRustApache License 2.0Apache-2.0

Datafuse

Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture

slack CI Status codecov dependency status patform license

Datafuse is a Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture written in Rust, inspired by ClickHouse and powered by arrow-rs, built to make it easy to power the Data Cloud.

Principles

  • Fearless

    • No data races, No unsafe, Minimize unhandled errors
  • High Performance

    • Everything is Parallelism
  • High Scalability

    • Everything is Distributed
  • High Reliability

    • Datafuse primary design goal is reliability

Architecture

Datafuse Architecture

Performance

  • Memory SIMD-Vector processing performance only
  • Dataset: 100,000,000,000 (100 Billion)
  • Hardware: AMD Ryzen 7 PRO 4750U, 8 CPU Cores, 16 Threads
  • Rust: rustc 1.53.0-nightly (673d0db5e 2021-03-23)
  • Build with Link-time Optimization and Using CPU Specific Instructions
  • ClickHouse server version 21.4.6 revision 54447
Query FuseQuery (v0.4.1) ClickHouse (v21.4.6)
SELECT avg(number) FROM numbers_mt(100000000000) 3.87 s.
(25.83 billion rows/s., 206.79 GB/s.)
×1.6 slow, (6.04 s.)
(16.57 billion rows/s., 132.52 GB/s.)
SELECT sum(number) FROM numbers_mt(100000000000) 4.86 s.
(20.57 billion rows/s., 164.70 GB/s.)
×1.2 slow, (5.90 s.)
(16.95 billion rows/s., 135.62 GB/s.)
SELECT min(number) FROM numbers_mt(100000000000) 5.61 s.
(17.82 billion rows/s., 142.65 GB/s.)
×2.3 slow, (13.05 s.)
(7.66 billion rows/s., 61.26 GB/s.)
SELECT max(number) FROM numbers_mt(100000000000) 5.61 s.
(17.82 billion rows/s., 142.67 GB/s.)
×2.5 slow, (14.07 s.)
(7.11 billion rows/s., 56.86 GB/s.)
SELECT count(number) FROM numbers_mt(100000000000) 3.12 s.
(32.03 billion rows/s., 256.48 GB/s.)
×1.2 slow, (3.71 s.)
(26.93 billion rows/s., 215.43 GB/s.)
SELECT sum(number+number+number) FROM numbers_mt(100000000000) 17.85 s.
(5.60 billion rows/s., 44.85 GB/s.)
×16.9 slow, (233.71 s.)
(427.87 million rows/s., 3.42 GB/s.)
SELECT sum(number) / count(number) FROM numbers_mt(100000000000) 4.02 s.
(24.86 billion rows/s., 199.10 GB/s.)
×2.4 slow, (9.70 s.)
(10.31 billion rows/s., 82.52 GB/s.)
SELECT sum(number) / count(number), max(number), min(number) FROM numbers_mt(100000000000) 9.60 s.
(10.41 billion rows/s., 83.38 GB/s.)
×3.4 slow, (32.87 s.)
(3.04 billion rows/s., 24.34 GB/s.)
SELECT number FROM numbers_mt(10000000000) ORDER BY number DESC LIMIT 1000 5.34 s.
(1.87 billion rows/s., 14.99 GB/s.)
×2.6 slow, (13.95 s.)
(716.62 million rows/s., 5.73 GB/s.)
SELECT max(number),sum(number) FROM numbers_mt(1000000000) GROUP BY number % 3, number % 4, number % 5 9.03 s.
(110.71 million rows/s., 886.50 MB/s.)
×3.5 fast, (2.60 s.)
(385.28 million rows/s., 3.08 GB/s.)

Note:

  • ClickHouse system.numbers_mt is 16-way parallelism processing, gist
  • FuseQuery system.numbers_mt is 16-way parallelism processing, gist

Status

General

  • SQL Parser
  • Query Planner
  • Query Optimizer
  • Predicate Push Down
  • Limit Push Down
  • Projection Push Down
  • Type coercion
  • Parallel Query Execution
  • Distributed Query Execution
  • Shuffle Hash GroupBy
  • Merge-Sort OrderBy
  • Joins (WIP)

SQL Support

  • Projection
  • Filter (WHERE)
  • Limit
  • Aggregate Functions
  • Scalar Functions
  • UDF Functions
  • SubQueries
  • Sorting
  • Joins (WIP)
  • Window (TODO)

Getting Started

Roadmap

Datafuse is currently in Alpha and is not ready to be used in production, Roadmap 2021

Contributing

License

Datafuse is licensed under Apache 2.0.