clickhouse-test

Source code - Github
Author - Gavin Noronha - gavinln@hotmail.com

About

This project provides a Ubuntu (20.04) Vagrant Virtual Machine (VM) with Clickhouse. Clickhouse is is an open-source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP).

There are Ansible scripts that automatically install the software when the VM is started.

Setup the machine

All the software installed exceeds the standard 10GB size of the virtual machine disk. Install the following plugin to resize the disk.

Check the Vagrantfile

vagrant validate

List the vagrant plugins

vagrant plugin list

Install the Vagrant vbguest plugin for virtualbox guest

vagrant plugin install vagrant-vbguest

Update vbguest

vagrant plugin update vagrant-vbguest

Start the virtual machine

vagrant up

vagrant ssh

Change to the clickhouse directory

cd /vagrant/clickhouse

Upgrade clickhouse

sudo apt update
sudo apt install clickhouse-client clickhouse-server

Start the Clickhouse database

sudo service clickhouse-server restart

Example queries

Start the clickhouse client. Displays client and server version.

clickhouse-client -m

List databases

show databases;

Use a database

use default

List tables in database

show tables

Create a database

CREATE DATABASE test_example;

Use the new database

USE test_example;

Create a table

CREATE TABLE visits (
    id UInt64,
    duration Float64,
    url String,
    created DateTime
) ENGINE = MergeTree()
PRIMARY KEY id
ORDER BY id;

Insert data into tables

INSERT INTO visits VALUES (1, 10.5, 'http://example.com',
    '2019-01-01 00:01:01');
INSERT INTO visits VALUES (2, 40.2, 'http://example1.com',
    '2019-01-03 10:01:01');
INSERT INTO visits VALUES (3, 13, 'http://example2.com',
    '2019-01-03 12:01:01');
INSERT INTO visits VALUES (4, 2, 'http://example3.com',
    '2019-01-04 02:01:01');

Query the table

SELECT url, duration
FROM visits
WHERE url = 'http://example2.com' LIMIT 2;

Run an aggregation query

SELECT SUM(duration) FROM visits;

Run a query that returns an array

SELECT topK(2) (url) FROM visits;

Drop the table

DROP table visits;

Drop the database

DROP database test_example;

Insert csv file into table

Use database

use default;

Create the table

CREATE TABLE stock (
     plant Int16,
     code Int16,
     service_level Float32,
     qty Int8
) ENGINE = MergeTree()
ORDER by plant;

Load csv data into the table

cat stock-example.csv | clickhouse-client --query="INSERT INTO stock FORMAT CSV";

Display the data

select * from stock;

Get number of rows

select count(*) from stock;

Insert parquet file into table

Load parquet data into the table

cat stock-example.parq | clickhouse-client --query="INSERT INTO stock FORMAT Parquet";

Display the data

select * from stock

Schedule a merge of parts of a table

optimize table stock

Metadata queries

List databases

show databases;

Use the system database;

use system;

Get list of tables

select database, name from tables where database = 'default';

Get list of columns

select database, table, name, type from columns where database = 'default';

Get table and column names

select database, table, name, type
from columns
where database = 'default'
    and table like 'flight%';

Clickhouse client on Ubuntu

Add Clickhouse GPG key

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4

Add repository to APT repository list

echo "deb https://repo.clickhouse.tech/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list

Update packages

sudo apt update

3.Install clickhouse client

sudo apt install clickhouse-client

Start the clickhouse client

export CH_SERVER=
clickhouse-client -h $CH_SERVER -d default -m  # multiline mode

Duck DB

Duck DB cli

Install Duck DB cli

wget https://github.com/cwida/duckdb/releases/download/v0.2.1/duckdb_cli-linux-amd64.zip

Unzip Duck DB cli

unzip duckdb_cli-linux-amd64.zip

Start Duck DB cli saving database in a file

./duckdb tmp3-duck.db

Create a table by reading from a Parquet file

create table flight as select * from parquet_scan('test.parquet');

Exit cli

.quit

AWS command line setup

Setup AWS configuration

source ./do_not_checkin/s3-user-setup.sh

Setup AWS completion

complete -C aws_completer aws

AWS spot instances

Describe spot price history

aws ec2 describe-spot-price-history --instance-types t3.2xlarge

Copy file from S3 - 11 seconds

aws s3 cp s3://airline-parq/2008_cleaned.gzip.parq .

Example Parquet commands

Duckdb parquet

import duckdb
con = duckdb.connect(database=":memory:")
sql = "select count(*) from parquet_metadata('scripts/ontime-100.pq')"
sql = "select count(*) from parquet_schema('scripts/ontime-100.pq')"
df = con.execute(sql).fetchdf()

Clickhouse parquet

Download prepared Clickhouse database partitions for airlines ontime data

curl -O https://datasets.clickhouse.tech/ontime/partitions/ontime.tar
sudo tar xvf ontime.tar -C /var/lib/clickhouse # path to ClickHouse data directory
# check permissions of unpacked data, fix if required
sudo service clickhouse-server restart
clickhouse-client --query "select count(*) from datasets.ontime"

Extract data to parquet format

# select * from datasets.ontime limit 1;
SQL="select * from datasets.ontime limit 100 FORMAT Parquet"
clickhouse-client --query="$SQL" > ontime-100.pq

SQL="select Year, FlightDate, Carrier, FlightNum from datasets.ontime order by Year limit 50000000 FORMAT Parquet"
clickhouse-client --query="$SQL" > scripts/ontime-50m.pq

SQL="select Year, FlightDate, Carrier, FlightNum from datasets.ontime order by Year limit 100000000 FORMAT Parquet"
clickhouse-client --query="$SQL" > scripts/ontime-100m.pq

Clickhouse local query

SQL="select Year, count(*) ct from file('scripts/ontime-10m.pq', Parquet, 'Year UInt16, FlightDate Date, Carrier String, FlightNum String') group by Year"
clickhouse-local --query "$SQL"

Clickhouse file parquet queries

clickhouse-client -m

clickhouse-local -q \
"select count(*) from file('/vagrant/scripts/ontime-100m.parquet', Parquet, 'Year UInt16, FlightDate Date, Carrier String, FlightNum String') group by Year"

Performance duckdb vs arrow vs clickhouse

Clickhouse is installed in the Vagrant virtual machine directly without using Docker containers. Copy the parquet files to ~ for the best performance

vagrant ssh

Setup the Python virtual environment

cd /vagrant
pipenv install --skip-lock
pipenv shell

Measure the performance using a 10 million row dataset

cp scripts/ontime-10m.parquet ~/
cd python
python parq-cli.py pandas ~/ontime-10m.parquet  # 6.7s
python parq-cli.py duck-pandas ~/ontime-10m.parquet  # 0.7s
python parq-cli.py duck-arrow ~/ontime-10m.parquet  # 0.8s
python parq-cli.py arrow-parquet ~/ontime-10m.parquet  # 0.5s
python parq-cli.py arrow-dataset-parquet ~/ontime-10m.parquet  # 0.4s
python parq-cli.py polars-parquet ~/ontime-10m.parquet  # 0.5s

Measure the performance using a 50 million row dataset

python parq-cli.py duck ~/ontime-50m.parquet  # 4.9s
python parq-cli.py ch_local ~/ontime-50m.parquet  # 4.3s

Measure the performance using a 100 million row dataset

cp scripts/ontime-100m.parquet ~/
cd python
python parq-cli.py pandas ~/ontime-100m.parquet  # 62s
python parq-cli.py duck-pandas ~/ontime-100m.parquet  # 7s
python parq-cli.py duck-arrow ~/ontime-100m.parquet  # 8s
python parq-cli.py arrow-parquet ~/ontime-100m.parquet  # 5s
python parq-cli.py arrow-dataset-parquet ~/ontime-100m.parquet  # 5s
python parq-cli.py polars-parquet ~/ontime-100m.parquet  # 5s

Connect duckdb and arrow to S3

Polars

Polars github

Polars vs Rust

Polars vs Pandas dataframe

Introduction to polars

Dataframe related frameworks

Data flow micro-framework

Hamilton

https://github.com/stitchfix/hamilton

Distributed data frame library in Python/Rust

https://github.com/Eventual-Inc/Daft

gavinln/clickhouse-test