A high-performance, production-ready SQL query engine built in Rust with Apache Arrow for vectorized execution.
- β‘ High Performance: Vectorized execution using Apache Arrow for maximum throughput
- π SQL Support: Comprehensive SQL syntax including SELECT, WHERE, GROUP BY, ORDER BY, LIMIT, and JOINs
- π JOIN Operations: Full support for INNER, LEFT, RIGHT, FULL OUTER, and CROSS JOINs with table aliases
- π Aggregate Functions: COUNT, SUM, AVG, MIN, MAX with GROUP BY support
- π Multiple Data Sources: CSV, Parquet, and in-memory tables
- π― Query Optimization: Predicate pushdown and logical plan optimization
- π» Interactive CLI: Full-featured REPL with syntax highlighting and history
- ποΈ Modular Architecture: Clean workspace structure with separated concerns
- π§ Production Ready: Optimized compilation, comprehensive error handling
- Installation
- Quick Start
- Architecture
- CLI Usage
- SQL Features
- Examples
- Performance
- Development
- Contributing
- License
- Rust 1.70 or higher
- Cargo (comes with Rust)
# Clone the repository
git clone https://github.com/AarambhDevHub/query-engine.git
cd query-engine
# Build release version
cargo build --release
# Install CLI globally
cargo install --path crates/query-cli
# Run the CLI
qe
# Development build
cargo build
# Optimized release build
cargo build --release
# Run tests
cargo test
# Run benchmarks
cargo bench
# Start the REPL
qe
# Or with database path
qe repl --db-path ./mydata
# Query a CSV file
qe query \
--sql "SELECT name, age FROM users WHERE age > 25" \
--table users \
--file data/users.csv
# Query with JOIN
qe query \
--sql "SELECT e.name, d.dept_name FROM employees e JOIN departments d ON e.dept_id = d.id" \
--table employees \
--file data/employees.csv
# Inside REPL
qe> .load csv data/employees.csv employees
qe> .load csv data/departments.csv departments
qe> SELECT e.name, d.dept_name FROM employees e JOIN departments d ON e.dept_id = d.dept_id;
qe> .describe employees
Query Engine follows a modular, layered architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLI / REPL β
β (User Interface) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query Executor β
β (Physical Plan Execution) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query Planner β
β (Logical Plan + Optimization) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query Parser β
β (SQL β AST) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Sources β
β (CSV, Parquet, Memory) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
query-engine/
βββ crates/
β βββ query-core/ # Core types and errors
β βββ query-parser/ # SQL lexer and parser
β βββ query-planner/ # Logical planning and optimization
β βββ query-executor/ # Physical execution engine
β βββ query-storage/ # Data source implementations
β βββ query-cli/ # Command-line interface
βββ examples-package/ # Usage examples
βββ Cargo.toml # Workspace configuration
qe> # Interactive prompt
# Data Management
.load csv <path> [name] # Load CSV file
.load parquet <path> [name] # Load Parquet file
.tables # List all tables
.describe <table> # Show table schema
.schema <table> # Show CREATE TABLE statement
.drop <table> # Remove table
# Configuration
.timing # Toggle query timing
.plan # Toggle query plan display
.format <type> # Set output format (table|json|csv)
# Utilities
.help # Show help
.clear # Clear screen
.quit # Exit REPL
# Query execution
qe query -s "SELECT * FROM users" -t users -f data.csv
# JOIN query
qe query -s "SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id" -t orders -f orders.csv
# Table registration
qe register -n users -f data/users.csv -t csv
# Benchmarking
qe bench -q queries/complex.sql -i 1000
# Export results
qe export \
-s "SELECT * FROM sales WHERE amount > 1000" \
-t sales \
-i data/sales.csv \
-o results.parquet \
-f parquet
# Show help
qe --help
-- Basic SELECT
SELECT column1, column2 FROM table_name;
-- WHERE clause
SELECT * FROM users WHERE age > 25 AND status = 'active';
-- JOIN operations (NEW!)
SELECT e.name, e.salary, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;
-- LEFT JOIN
SELECT u.name, o.order_id
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
-- CROSS JOIN
SELECT p.name, c.category
FROM products p
CROSS JOIN categories c;
-- Multiple JOINs
SELECT e.name, d.dept_name, l.location
FROM employees e
JOIN departments d ON e.dept_id = d.id
JOIN locations l ON d.location_id = l.id;
-- Aggregate functions
SELECT department, COUNT(*), AVG(salary)
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;
-- ORDER BY and LIMIT
SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 10;
-- Complex query with JOINs and aggregates
SELECT
d.dept_name,
COUNT(e.id) as employee_count,
AVG(e.salary) as avg_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY avg_salary DESC;
- INNER JOIN: Returns matching rows from both tables
- LEFT JOIN (LEFT OUTER JOIN): Returns all rows from left table with matching rows from right
- RIGHT JOIN (RIGHT OUTER JOIN): Returns all rows from right table with matching rows from left
- FULL JOIN (FULL OUTER JOIN): Returns all rows when there's a match in either table
- CROSS JOIN: Returns Cartesian product of both tables
- Arithmetic:
+,-,*,/,% - Comparison:
=,!=,<>,<,<=,>,>= - Logical:
AND,OR,NOT - Functions:
COUNT,SUM,AVG,MIN,MAX
- Integer:
INT8,INT16,INT32,INT64 - Unsigned:
UINT8,UINT16,UINT32,UINT64 - Float:
FLOAT32,FLOAT64 - String:
UTF8,BINARY - Date/Time:
DATE32,DATE64,TIMESTAMP - Boolean:
BOOLEAN - Null:
NULL
use query_engine::*;
#[tokio::main]
async fn main() -> Result<()> {
// Create schema
let schema = Schema::new(vec![
Field::new("id", DataType::Int64, false),
Field::new("name", DataType::Utf8, false),
Field::new("age", DataType::Int64, false),
]);
// Create sample data
let batch = RecordBatch::try_new(
Arc::new(schema.to_arrow()),
vec![
Arc::new(Int64Array::from(vec!)),[1][2][3][4][5]
Arc::new(StringArray::from(vec!["Alice", "Bob", "Charlie", "Diana", "Eve"])),
Arc::new(Int64Array::from(vec!)),
],
)?;
// Parse and execute query
let sql = "SELECT name, age FROM users WHERE age > 28";
let mut parser = Parser::new(sql)?;
let statement = parser.parse()?;
// Create logical plan
let mut planner = Planner::new();
planner.register_table("users", schema);
let plan = planner.create_logical_plan(&statement)?;
println!("Query Plan: {:#?}", plan);
Ok(())
}
-- Load employee and department data
.load csv data/employees.csv employees
.load csv data/departments.csv departments
-- Simple INNER JOIN
SELECT e.name, e.salary, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;
-- LEFT JOIN to include all employees
SELECT e.name, e.salary, d.dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;
-- JOIN with WHERE clause
SELECT e.name, d.dept_name, d.location
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id
WHERE d.location = 'Building A' AND e.salary > 70000;
-- JOIN with aggregates
SELECT d.dept_name,
COUNT(e.id) as employee_count,
AVG(e.salary) as avg_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY avg_salary DESC;
-- Load data
.load csv data/sales.csv sales
-- Total sales by region
SELECT region, SUM(amount) as total
FROM sales
GROUP BY region
ORDER BY total DESC;
-- Monthly statistics
SELECT
month,
COUNT(*) as orders,
SUM(amount) as revenue,
AVG(amount) as avg_order
FROM sales
GROUP BY month;
-- Top performing products
SELECT
product_name,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(quantity) as total_units,
SUM(amount) as total_revenue,
AVG(amount) as avg_price
FROM sales
GROUP BY product_name
HAVING total_revenue > 10000
ORDER BY total_revenue DESC
LIMIT 10;
# Start REPL and load data
$ qe
qe> .load csv data/employees.csv employees
β Loaded table 'employees' from data/employees.csv (1000 rows, 5 columns)
qe> .load csv data/departments.csv departments
β Loaded table 'departments' from data/departments.csv (10 rows, 4 columns)
qe> .describe employees
Table: employees
Source: CSV file: data/employees.csv
Rows: 1000
βββββββββββββββββ¬βββββββββββ¬βββββββββββ
β Column β Type β Nullable β
βββββββββββββββββΌβββββββββββΌβββββββββββ€
β employee_id β Int64 β NO β
β name β Utf8 β NO β
β department_id β Int64 β NO β
β salary β Float64 β NO β
β hire_date β Date32 β YES β
βββββββββββββββββ΄βββββββββββ΄βββββββββββ
qe> SELECT e.name, e.salary, d.dept_name
FROM employees e
JOIN departments d ON e.department_id = d.dept_id
WHERE e.salary > 80000;
β Query parsed and planned successfully!
Planning time: 0.05ms
- Vectorized Execution: SIMD operations via Apache Arrow
- Predicate Pushdown: Filter data early in the pipeline
- Projection Pushdown: Read only required columns
- JOIN Optimization: Efficient hash-based JOIN implementation
- LTO Compilation: Link-time optimization for release builds
- Zero-Copy Operations: Minimize memory allocations
# Run benchmarks
cargo bench
# Custom benchmark
qe bench -q queries/complex.sql -i 1000
# JOIN benchmark
cargo run --example join_query
Sample Results (your results may vary):
Benchmark Results:
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Query Type: INNER JOIN
Iterations: 1000
Total time: 1.23s
Average: 1.23ms
Median: 1.18ms
Min: 0.98ms
Max: 3.45ms
95th percentile: 1.67ms
99th percentile: 2.34ms
QPS: 813.01
ββββββββββββββββββββββββββββββββββββββββββββββββββ
[profile.release]
opt-level = 3 # Maximum optimization
lto = "fat" # Full link-time optimization
codegen-units = 1 # Better optimization
panic = "abort" # Smaller binary
strip = true # Remove debug symbols
# Debug build
cargo build
# Release build with optimizations
cargo build --release
# Build specific crate
cargo build -p query-parser
# Build CLI
cargo build --release -p query-cli
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific test
cargo test test_join_queries
# Run tests in specific crate
cargo test -p query-parser
# Simple query example
cargo run --example simple_query
# Aggregate query example
cargo run --example aggregate_query
# JOIN query example (NEW!)
cargo run --example join_query
# Full demo
cargo run --example full_query_demo
# Format code
cargo fmt
# Run clippy lints
cargo clippy
# Check without building
cargo check
Issue: File not found error
# Solution: Use absolute path or relative to current directory
qe> .load csv ./data/users.csv users
Issue: Column not found error
-- Solution: Check table schema
qe> .describe users
Issue: Failed to create logical plan for CROSS JOIN
-- Solution: Make sure table aliases are used correctly
-- Good:
SELECT e.name, d.dept_name FROM employees e CROSS JOIN departments d;
-- Bad:
SELECT name, dept_name FROM employees CROSS JOIN departments;
Issue: Slow query performance
# Solution: Enable query plan to see optimization
qe> .plan
qe> SELECT * FROM large_table WHERE condition;
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Write tests for new features
- Follow Rust naming conventions
- Update documentation
- Run
cargo fmtandcargo clippy - Ensure all tests pass
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Apache Arrow - Columnar in-memory data format
- Apache Parquet - Columnar storage format
- DataFusion - Inspiration for query engine design
- Rust Community - For excellent tooling and libraries
- Author: Darshan Vichhi (Aarambh)
- GitHub: @AarambhDevHub
- Issues: GitHub Issues
-
JOIN operations (INNER, LEFT, RIGHT, FULL, CROSS)β Completed! - Subqueries and CTEs
- Window functions
- User-defined functions (UDFs)
- Index support
- Distributed execution
- Query caching
- Real-time streaming queries
- PostgreSQL protocol compatibility
- Web UI dashboard
Version: 0.1.0 Status: Active Development Stability: Alpha
- β Full JOIN support (INNER, LEFT, RIGHT, FULL OUTER, CROSS)
- β Table aliases and qualified column names
- β Multiple JOIN operations in single query
- β JOIN with WHERE, GROUP BY, ORDER BY clauses
Built with β€οΈ in Rust