RClickhouse
Overview
Yandex Clickhouse is a high-performance relational column-store database to enable big data exploration and 'analytics' scaling to petabytes of data. Methods are provided that enable working with 'Yandex Clickhouse' databases via 'DBI' methods and using 'dplyr'/'dbplyr' idioms.
This R package is a DBI interface for the Yandex Clickhouse database. It provides basic dplyr support by auto-generating SQL-commands using dbplyr and is based on the official C++ Clickhouse Client.
To cite this library, please use the BibTeX entry provided in inst/CITATION. test
Installation
This package is available on CRAN, and thus installable by running:
install.packages("RClickhouse")
You can also install the latest development version directly from github using devtools:
devtools::install_github("IMSMWU/RClickhouse")
Usage
Create a DBI Connection:
Note: please be aware that {RClickhouse} doesn't use a HTTP interface in order to communicate with Clickhouse. Thus, You may use the native interface port (by default 9000) instead of the HTTP interface (8123).
con <- DBI::dbConnect(RClickhouse::clickhouse(), host="example-db.com")
Write data to the database:
DBI::dbWriteTable(con, "mtcars", mtcars)
dbListTables(con)
dbListFields(con, "mtcars")
dplyr:
Query a database usinglibrary(dplyr)
tbl(con, "mtcars") %>%
group_by(cyl) %>%
summarise(smpg=sum(mpg))
tbl(con, "mtcars") %>%
filter(cyl == 8, vs == 0) %>%
group_by(am) %>%
summarise(mean(qsec))
# Close the connection
dbDisconnect(con)
SQL-style commands with DBI::dbGetQuery
:
Query a database using DBI::dbGetQuery(con, "SELECT
vs
,COUNT(*) AS 'number of cases'
,AVG(qsec) AS 'average qsec'
FROM mtcars
GROUP BY vs")
# Save results of querying:
res <- DBI::dbGetQuery(con, "SELECT (*)
FROM mtcars
WHERE am = 1")
# Or save the whole set of data (only useful for smaller datasets, for better performance and for larger datasets always use remote servers):
mtcars <- dbReadTable(con, mtcars)
# Close the connection
dbDisconnect(con)
ClickHouse functions
Query a database using# Get the names of all the avaliable databases
DBI::dbGetQuery(con, "SHOW DATABASES")
# Get information about the variable names and types
DBI::dbGetQuery(con, "DESCRIBE TABLE mtcars")
# Compact CASE - WHEN - THEN conditionals
DBI::dbGetQuery(con, "SELECT multiIf(am='1', 'automatic', 'manual') AS 'transmission'
,multiIf(vs='1', 'straight', 'V-shaped') AS 'engine'
FROM mtcars")
# Close the connection
dbDisconnect(con)
Config File
You may use a config file that is looked up for automatic initialization of the dbConnect parameters.
To do so, create a yaml file (default RClickhouse.yaml
), in at least one directory (default lookup paths of parameter config_paths: ./RClickhouse.yaml, ~/.R/RClickhouse.yaml, /etc/RClickhouse.yaml
), e.g. ~/.R/configs/RClickhouse.yaml
and pass a vector of the corresponding file paths to dbConnect
as config_paths
parameter.
In RClickhouse.yaml
, you may specify a variable number of parameters (host, port, db, user, password, compression
) to be initialized using the following format (example):
host: example-db.com
port: 1111
The actual initialization of the parameters of dbConnect
follows a hierarchical structure with varying priorities (1 to 3, where 1 is highest):
- Specified input parameters when calling
dbConnect
. If parameters are unspecified, fall back to (2) - Parameters specified in
RClickhouse.yaml
, where the level of priority depends on the position of the path in the config_path input vector (first position, highest priority etc.). If parameters are unspecified, fall back to (3). - Default parameters (
host="localhost", port = 9000, db = "default", user = "default", password = "", compression = "lz4"
).
Acknowledgements
Big thanks to Kirill Müller, Maxwell Peterson, Artemkin Pavel and Hannes Mühleisen.