/flight-sql-server-example

An example Flight SQL Server implementation - with DuckDB and SQLite back-ends.

Primary LanguageC++Apache License 2.0Apache-2.0

Arrow Flight SQL server - DuckDB / SQLite

Description

This repo demonstrates how to build an Apache Arrow Flight SQL server implementation using DuckDB or SQLite as a backend database.

It enables authentication via middleware and encrypts connections to the database via TLS.

For more information about Apache Arrow Flight SQL - please see this article.

Option 1 - Running from the published Docker image

Open a terminal, then pull and run the published Docker image which has everything setup (change: "--detach" to "--interactive" if you wish to see the stdout on your screen) - with command:

docker run --name flight-sql \
           --detach \
           --rm \
           --tty \
           --init \
           --publish 31337:31337 \
           --env FLIGHT_PASSWORD="flight_password" \
           --env PRINT_QUERIES="1" \
           --pull missing \
           voltrondata/flight-sql:latest

The above command will automatically mount a very small TPC-H DuckDB database file.

Optional - open a different database file

When running the Docker image - you can have it run your own DuckDB database file (the database must be built with DuckDB version: 0.7.0).

Prerequisite: DuckDB CLI
Install DuckDB CLI version 0.7.0 - and make sure the executable is on your PATH.

Platform Downloads:
Linux x86-64
Linux arm64 (aarch64)
MacOS Universal

In this example, we'll generate a new TPC-H Scale Factor 1 (1GB) database file, and then run the docker image to mount it:

# Generate a TPC-H database in the host's /tmp directory
duckdb /tmp/tpch_sf1.duckdb << EOF
.bail on
.echo on
SELECT VERSION();
INSTALL tpch;
LOAD tpch;
CALL dbgen(sf=1);
EOF

# Run the flight-sql docker container image - and mount the host's DuckDB database file created above inside the container
docker run --name flight-sql \
           --detach \
           --rm \
           --tty \
           --init \
           --publish 31337:31337 \
           --env FLIGHT_PASSWORD="flight_password" \
           --pull missing \
           --mount type=bind,source=/tmp,target=/opt/flight_sql/data \
           --env DATABASE_FILE_NAME="tpch_sf1.duckdb" \
           voltrondata/flight-sql:latest

Connecting to the server via JDBC

Download the Apache Arrow Flight SQL JDBC driver

You can then use the JDBC driver to connect from your host computer to the locally running Docker Flight SQL server with this JDBC string (change the password value to match the value specified for the FLIGHT_PASSWORD environment variable if you changed it from the example above):

jdbc:arrow-flight-sql://localhost:31337?useEncryption=true&user=flight_username&password=flight_password&disableCertificateVerification=true

For instructions on setting up the JDBC driver in popular Database IDE tool: DBeaver Community Edition - see this repo.

Note - if you stop/restart the Flight SQL Docker container, and attempt to connect via JDBC with the same password - you could get error: "Invalid bearer token provided. Detail: Unauthenticated". This is because the client JDBC driver caches the bearer token signed with the previous instance's RSA private key. Just change the password in the new container by changing the "FLIGHT_PASSWORD" env var setting - and then use that to connect via JDBC.

Connecting to the server via the new ADBC Python Flight SQL driver

You can now use the new Apache Arrow Python ADBC Flight SQL driver to query the Flight SQL server. ADBC offers performance advantages over JDBC - because it minimizes serialization/deserialization, and data stays in columnar format at all phases.

You can learn more about ADBC and Flight SQL here.

Ensure you have Python 3.9+ installed, then open a terminal, then run:

# Create a Python virtual environment
python3 -m venv ./venv

# Activate the virtual environment
. ./venv/bin/activate

# Install the requirements including the new Arrow ADBC Flight SQL driver
pip install --upgrade pip
pip install pandas pyarrow adbc_driver_flightsql

# Start the python interactive shell
python

In the Python shell - you can then run:

import adbc_driver_flightsql.dbapi as flight_sql

flight_password = "flight_password" # Use an env var in production code!

with flight_sql.connect(uri="grpc+tls://localhost:31337",
                        db_kwargs={"username": "flight_username",
                                   "password": flight_password,
                                   "adbc.flight.sql.client_option.tls_skip_verify": "true" # Not needed if you use a trusted CA-signed TLS cert
                                   }
                        ) as conn:
   with conn.cursor() as cur:
       cur.execute("SELECT n_nationkey, n_name FROM nation WHERE n_nationkey = ?",
                   parameters=[24]
                   )
       x = cur.fetch_arrow_table()
       print(x)

You should see results:

pyarrow.Table
n_nationkey: int32
n_name: string
----
n_nationkey: [[24]]
n_name: [["UNITED STATES"]]

Tear-down

Stop the docker image with:

docker stop flight-sql

Option 2 - Steps to build the solution manually

In order to run build the solution manually, and run SQLite and DuckDB Flight SQL server, you need to set up a new Python 3.8+ virtual environment on your machine. Follow these steps to do so (thanks to David Li!).

  1. Clone the repo, ensure you have Python 3.8+ installed, then create a virtual environment from the root of this repo and install requirements...
git clone https://github.com/voltrondata/flight-sql-server-example --recurse-submodules
cd flight-sql-server-example
python3 -m venv ./venv
. ./venv/bin/activate
pip install --upgrade pip
pip install --requirement ./requirements.txt
  1. Build and install Arrow
scripts/build_arrow.sh
  1. Build and install duckdb. This is sometimes necessary as conda compilers seem to be including incompatible GlibC library with the compiled binaries of duckdb.
scripts/build_duckdb.sh
  1. Get some SQLite3 sample data.
wget https://github.com/lovasoa/TPCH-sqlite/releases/download/v1.0/TPC-H-small.db -O ./data/TPC-H-small.db
  1. Create a DuckDB database.
python "scripts/create_duckdb_database_file.py" \
       --file-name="TPC-H-small.duckdb" \
       --file-path="data" \
       --overwrite-file=true \
       --scale-factor=0.01
  1. Build the Flight SQL Server executable.
. ~/.bashrc
mkdir -p build
pushd build
cmake .. -GNinja -DCMAKE_PREFIX_PATH=$ARROW_HOME/lib/cmake
ninja
popd
  1. Generate TLS certificates for encrypting traffic to/from the Flight SQL server
pushd tls
./gen-certs.sh
popd
  1. Start the Flight SQL server (and print client SQL commands as they run using the --print_queries option)
FLIGHT_PASSWORD="flight_password" ./flight_sql --database_file_name "TPC-H-small.duckdb" --print_queries

Selecting different backends

This option allows choosing from two backends: SQLite and DuckDB. It defaults to DuckDB.

$ FLIGHT_PASSWORD="flight_password" ./flight_sql --database_file_name "TPC-H-small.duckdb"
> Using database file: ../data/TPC-H-small.duckdb
> duckdb server listening on grpc+tls://0.0.0.0:31337

The above call is equivalent to running ./flight_sql -B duckdb or ./flight_sql --backend duckdb. To select SQLite run

FLIGHT_PASSWORD="flight_password" ./flight_sql -B sqlite -D "TPC-H-small.db" 

or

FLIGHT_PASSWORD="flight_password" ./flight_sql --backend sqlite --database_file_name "TPC-H-small.db"

The above will produce the following:

> Using database file: ../data/TPC-H-small.db
> sqlite server listening on grpc+tls://0.0.0.0:31337

Print help

To see all the available options run ./flight.sql --help.

./flight_sql --help
Allowed options:
  --help                                produce this help message
  -B [ --backend ] arg (=duckdb)        Specify the database backend. Allowed 
                                        options: duckdb, sqlite.
  -P [ --database_file_path ] arg (=../data)
                                        Specify the search path for the 
                                        database file.
  -D [ --database_file_name ] arg       Specify the database filename (the file
                                        must be in search path)
  -Q [ --print_queries ]                Print queries run by clients to stdout