A tutorial of Apache Calcite for the BOSS'21 VLDB workshop.
In this tutorial, we demonstrate the main components of Calcite and how they interact with each other. To do this we build, step-by-step, a fully fledged query processor for data residing in Lucene indexes, and gradually introduce various extensions covering some common use-cases appearing in practice.
The project has three modules:
indexer
, containing the necessary code to populate some sample dataset(s) into Lucene to demonstrate the capabilities of the query processor;solution
, containing the material of the tutorial fully implemented along with a few unit tests ensuring the correctness of the code;template
, containing only the skeleton and documentation of selected classes, which the attendees can use to follow the real-time implementation of the Lucene query processor.
- JDK version >= 8
To compile the project, run:
./mvnw package -DskipTests
To load/index the TPC-H dataset in Lucene, run:
java -jar indexer/target/indexer-1.0-SNAPSHOT-jar-with-dependencies.jar
The indexer creates the data under target/tpch
directory. The TPC-H dataset was generated using
the dbgen command line utility (dbgen -s 0.001
) provided in the original
TPC-H tools bundle.
To execute SQL queries over the data in Lucene, and get a feeling of how the finished query processor looks like, run:
java -jar solution/target/solution-1.0-SNAPSHOT-jar-with-dependencies.jar SIMPLE queries/tpch/Q0.sql
java -jar solution/target/solution-1.0-SNAPSHOT-jar-with-dependencies.jar ADVANCED queries/tpch/Q0.sql
java -jar solution/target/solution-1.0-SNAPSHOT-jar-with-dependencies.jar PUSHDOWN queries/tpch/Q0.sql
The finished query processor provides three execution modes, representing the three main sections which are covered in this tutorial.
You can use one of the predefined queries under queries/tpch
directory or create a new file
and write your own.
In SIMPLE
mode, the query processor does not do any advanced optimization and shows how easy it
is to build an adapter from scratch with very few lines of customized code by relying on the
built-in operators of the EnumerableConvention
and the ScannableTable
interface.
In ADVANCED
mode, the query processor is able to combine operators with different characteristics
demonstrating the most common implementation pattern of an adapter and sets the bases for building
federation query engines using Calcite. In this mode, we combine two kinds of operators using the
built-in EnumerableConvention
and the custom LuceneRel#LUCENE
convention along with some basic
optimization rules.
In PUSHDOWN
mode, the query processor combines operators with different characteristics and is
also capable of pushing simple filtering conditions to the underlying engine by introducing
custom rules, expression transformations, and additional operators.