- Source code - Github
- Author - Gavin Noronha - gavinln@hotmail.com
This project provides a Ubuntu (20.04) Vagrant Virtual Machine (VM) with Clickhouse. Clickhouse is is an open-source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP).
There are Ansible scripts that automatically install the software when the VM is started.
All the software installed exceeds the standard 10GB size of the virtual machine disk. Install the following plugin to resize the disk.
- Check the Vagrantfile
vagrant validate
- List the vagrant plugins
vagrant plugin list
- Install the Vagrant vbguest plugin for virtualbox guest
vagrant plugin install vagrant-vbguest
- Update vbguest
vagrant plugin update vagrant-vbguest
- Start the virtual machine
vagrant up
- Login to the virtual machine
vagrant ssh
- Change to the clickhouse directory
cd /vagrant/clickhouse
- Upgrade clickhouse
sudo apt update
sudo apt install clickhouse-client clickhouse-server
- Start the Clickhouse database
sudo service clickhouse-server restart
- Start the clickhouse client. Displays client and server version.
clickhouse-client -m
- List databases
show databases;
- Use a database
use default
- List tables in database
show tables
- Create a database
CREATE DATABASE test_example;
- Use the new database
USE test_example;
- Create a table
CREATE TABLE visits (
id UInt64,
duration Float64,
url String,
created DateTime
) ENGINE = MergeTree()
PRIMARY KEY id
ORDER BY id;
- Insert data into tables
INSERT INTO visits VALUES (1, 10.5, 'http://example.com',
'2019-01-01 00:01:01');
INSERT INTO visits VALUES (2, 40.2, 'http://example1.com',
'2019-01-03 10:01:01');
INSERT INTO visits VALUES (3, 13, 'http://example2.com',
'2019-01-03 12:01:01');
INSERT INTO visits VALUES (4, 2, 'http://example3.com',
'2019-01-04 02:01:01');
- Query the table
SELECT url, duration
FROM visits
WHERE url = 'http://example2.com' LIMIT 2;
- Run an aggregation query
SELECT SUM(duration) FROM visits;
- Run a query that returns an array
SELECT topK(2) (url) FROM visits;
- Drop the table
DROP table visits;
- Drop the database
DROP database test_example;
- Use database
use default;
- Create the table
CREATE TABLE stock (
plant Int16,
code Int16,
service_level Float32,
qty Int8
) ENGINE = MergeTree()
ORDER by plant;
- Load csv data into the table
cat stock-example.csv | clickhouse-client --query="INSERT INTO stock FORMAT CSV";
- Display the data
select * from stock;
- Get number of rows
select count(*) from stock;
- Load parquet data into the table
cat stock-example.parq | clickhouse-client --query="INSERT INTO stock FORMAT Parquet";
- Display the data
select * from stock
- Schedule a merge of parts of a table
optimize table stock
- List databases
show databases;
- Use the system database;
use system;
- Get list of tables
select database, name from tables where database = 'default';
- Get list of columns
select database, table, name, type from columns where database = 'default';
- Get table and column names
select database, table, name, type
from columns
where database = 'default'
and table like 'flight%';
- Add Clickhouse GPG key
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4
- Add repository to APT repository list
echo "deb https://repo.clickhouse.tech/deb/stable/ main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list
- Update packages
sudo apt update
3.Install clickhouse client
sudo apt install clickhouse-client
- Start the clickhouse client
export CH_SERVER=
clickhouse-client -h $CH_SERVER -d default -m # multiline mode
- Install Duck DB cli
wget https://github.com/cwida/duckdb/releases/download/v0.2.1/duckdb_cli-linux-amd64.zip
- Unzip Duck DB cli
unzip duckdb_cli-linux-amd64.zip
- Start Duck DB cli saving database in a file
./duckdb tmp3-duck.db
- Create a table by reading from a Parquet file
create table flight as select * from parquet_scan('test.parquet');
- Exit cli
.quit
- Setup AWS configuration
source ./do_not_checkin/s3-user-setup.sh
- Setup AWS completion
complete -C aws_completer aws
- Describe spot price history
aws ec2 describe-spot-price-history --instance-types t3.2xlarge
- Copy file from S3 - 11 seconds
aws s3 cp s3://airline-parq/2008_cleaned.gzip.parq .
import duckdb
con = duckdb.connect(database=":memory:")
sql = "select count(*) from parquet_metadata('scripts/ontime-100.pq')"
sql = "select count(*) from parquet_schema('scripts/ontime-100.pq')"
df = con.execute(sql).fetchdf()
- Download prepared Clickhouse database partitions for airlines ontime data
curl -O https://datasets.clickhouse.tech/ontime/partitions/ontime.tar
sudo tar xvf ontime.tar -C /var/lib/clickhouse # path to ClickHouse data directory
# check permissions of unpacked data, fix if required
sudo service clickhouse-server restart
clickhouse-client --query "select count(*) from datasets.ontime"
- Extract data to parquet format
# select * from datasets.ontime limit 1;
SQL="select * from datasets.ontime limit 100 FORMAT Parquet"
clickhouse-client --query="$SQL" > ontime-100.pq
SQL="select Year, FlightDate, Carrier, FlightNum from datasets.ontime order by Year limit 50000000 FORMAT Parquet"
clickhouse-client --query="$SQL" > scripts/ontime-50m.pq
SQL="select Year, FlightDate, Carrier, FlightNum from datasets.ontime order by Year limit 100000000 FORMAT Parquet"
clickhouse-client --query="$SQL" > scripts/ontime-100m.pq
- Clickhouse local query
SQL="select Year, count(*) ct from file('scripts/ontime-10m.pq', Parquet, 'Year UInt16, FlightDate Date, Carrier String, FlightNum String') group by Year"
clickhouse-local --query "$SQL"
clickhouse-client -m
clickhouse-local -q \
"select count(*) from file('/vagrant/scripts/ontime-100m.parquet', Parquet, 'Year UInt16, FlightDate Date, Carrier String, FlightNum String') group by Year"
Clickhouse is installed in the Vagrant virtual machine directly without using Docker containers. Copy the parquet files to ~ for the best performance
- Login to the VM
vagrant ssh
- Setup the Python virtual environment
cd /vagrant
pipenv install --skip-lock
pipenv shell
- Measure the performance using a 10 million row dataset
cp scripts/ontime-10m.parquet ~/
cd python
python parq-cli.py pandas ~/ontime-10m.parquet # 6.7s
python parq-cli.py duck-pandas ~/ontime-10m.parquet # 0.7s
python parq-cli.py duck-arrow ~/ontime-10m.parquet # 0.8s
python parq-cli.py arrow-parquet ~/ontime-10m.parquet # 0.5s
python parq-cli.py arrow-dataset-parquet ~/ontime-10m.parquet # 0.4s
python parq-cli.py polars-parquet ~/ontime-10m.parquet # 0.5s
- Measure the performance using a 50 million row dataset
python parq-cli.py duck ~/ontime-50m.parquet # 4.9s
python parq-cli.py ch_local ~/ontime-50m.parquet # 4.3s
- Measure the performance using a 100 million row dataset
cp scripts/ontime-100m.parquet ~/
cd python
python parq-cli.py pandas ~/ontime-100m.parquet # 62s
python parq-cli.py duck-pandas ~/ontime-100m.parquet # 7s
python parq-cli.py duck-arrow ~/ontime-100m.parquet # 8s
python parq-cli.py arrow-parquet ~/ontime-100m.parquet # 5s
python parq-cli.py arrow-dataset-parquet ~/ontime-100m.parquet # 5s
python parq-cli.py polars-parquet ~/ontime-100m.parquet # 5s
- Polars vs Pandas dataframe
- Introduction to polars
Hamilton
https://github.com/stitchfix/hamilton
https://github.com/Eventual-Inc/Daft
- Setup Clickhouse on Debian
- Clickhouse ORM for Python
- Clickhouse to pandas
- Clickhouse Python driver with native support
- [Aggregating merge tree]
- Python child processes buffering
- Python pipes
- Clickhouse machine learning
- Clickhouse at Infovista
- Clickhouse at Messagebird
- Clickhouse funnel analysis
- Clickhouse discussion
- Clickhouse administration
- Introduction to Clickhouse
[Clickhouse arrays - Part 2][1230]
[1230] https://altinity.com/blog/harnessing-the-power-of-clickhouse-arrays-part-2
Prepared partitions of airline ontime data
Druid for funnel analysis Video