Study of dplyr SQL generation
Opened this issue ยท 7 comments
The purpose of this issue is to provide a description of the internals of dplyr's SQL generation. I intend it to serve as reference/inspiration for the development of such functionality in this package.
We first create a local SQLite3 database according to the vignette :
> my_db <- src_sqlite("my_db.sqlite3", create = TRUE)
> library(nycflights13)
> flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = c("year", "month", "day"), "carrier", "tailnum"))
(We are not concerned with the generation of indices at this point.)
It is worth examining the structure of flights_sqlite
object:
> attributes(flights_sqlite)
$names
[1] "src" "ops"
$class
[1] "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl"
> str(flights_sqlite)
List of 2
$ src:List of 2
..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
.. .. ..@ Id :<externalptr>
.. .. ..@ dbname : chr "my_db.sqlite3"
.. .. ..@ loadable.extensions: logi TRUE
.. .. ..@ flags : int 6
.. .. ..@ vfs : chr ""
..$ path: chr "my_db.sqlite3"
..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
$ ops:List of 3
..$ src :List of 2
.. ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
.. .. .. ..@ Id :<externalptr>
.. .. .. ..@ dbname : chr "my_db.sqlite3"
.. .. .. ..@ loadable.extensions: logi TRUE
.. .. .. ..@ flags : int 6
.. .. .. ..@ vfs : chr ""
.. ..$ path: chr "my_db.sqlite3"
.. ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
..$ x :Classes 'ident', 'sql', 'character' chr "flights"
..$ vars: chr [1:19] "year" "month" "day" "dep_time" ...
..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
- attr(*, "class")= chr [1:4] "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl"
We will see that the data manipulation commands provided by dplyr dispatch on the tbl_lazy
class in flights_sqlite
's class structure and are accumulated in its ops
field.
For instance, let's trace the cascade of functions resulting from a call to
c1 <- filter(flights_sqlite, year == 2013, month == 1, day == 1)
The adventure starts in R/manip.R:
filter <- function(.data, ...) {
filter_(.data, .dots = lazyeval::lazy_dots(...))
}
The filtering expressions (in this case year == 2013
, month == 1
, day == 1
) are passed to lazyeval
, which produces the following lazy_dots
object
> .dots <- lazyeval::lazy_dots(year == 2013, month == 1, day == 1)
> str(.dots)
List of 3
$ :List of 2
..$ expr: language year == 2013
..$ env :<environment: R_GlobalEnv>
..- attr(*, "class")= chr "lazy"
$ :List of 2
..$ expr: language month == 1
..$ env :<environment: R_GlobalEnv>
..- attr(*, "class")= chr "lazy"
$ :List of 2
..$ expr: language day == 1
..$ env :<environment: R_GlobalEnv>
..- attr(*, "class")= chr "lazy"
- attr(*, "class")= chr "lazy_dots"
which encodes the information (i.e. the un-evaluated filtering expressions) passed to the original filter
command. The original data source, i.e. flights_sqlite
and .dots
are then passed to the eagerly-evaluating filter_
generic.
To find the specific method called on (flights_sqlite, .dots)
, we need to identify the class structure of flights_sqlite
:
> class(flights_sqlite)
[1] "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl"
(TODO: understand how flights_sqlite
comes to have this class structure.)
The relevant class, i.e. the one on which filter_
dispatches, turns out to be tbl_lazy
, and so we examine the relevant method given in R/tbl_lazy.R:
filter_.tbl_lazy <- function(.data, ..., .dots) {
dots <- lazyeval::all_dots(.dots, ...)
add_op_single("filter", .data, dots = dots)
}
Since the ...
argument is empty, the dots
object returned by the lazyeval::all_dots
call is "equivalent" (i.e., has the same structure and components) to the original .dots
argument.
The purpose of filter_.tbl_lazy
is ultimately to collect the information about the filtering expressions and store it in an op
object, which is in turn stored in the ops
field of the original tbl_lazy
data source, i.e. flights_sqlite
. This takes us to the implementation of add_op_single
in the land of R/lazy_ops.R
:
add_op_single <- function(name, .data, dots = list(), args = list()) {
.data$ops <- op_single(name, x = .data$ops, dots = dots, args = args)
.data
}
(We note that the ops
field of flights_sqlite
is initiated with the op_base_remote
method).
op_single
in turn returns an op
object whose most specific class is determined by the manipulation command from which the call to add_op_single
originated:
op_single <- function(name, x, dots = list(), args = list()) {
structure(
list(
name = name,
x = x,
dots = dots,
args = args
),
class = c(paste0("op_", name), "op_single", "op")
)
}
Thus we find that add_op_single
(the "single" refers to the fact that filter
is a single-table verb) modifies the ops
field of the .data
argument, i.e. flights_sqlite
. In particular, since flights_sqlite$ops
is passed as the x
argument to op_single
, the resultant op_filter
object "points to" the original op_base_remote
object initialized at creation of flights_sqlite
. In this way dplyr lazily accumulates manipulation commands applied to remote data sources in a graph structure that can then be analyzed, flattened, and translated into SQL.
Indeed, if we examine the ops
field of c1
, we see that its value is an op_filter
object that contains both the original op_base_remote
object as well as all the requisite information about the filtering expressions passed to filter
above:
> c1$ops
Source: sqlite 3.8.6 [my_db.sqlite3]
From: flights
<Table: flights>
-> filter()
- year == 2013
- month == 1
- day == 1
> str(c1$ops)
List of 4
$ name: chr "filter"
$ x :List of 3
..$ src :List of 2
.. ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
.. .. .. ..@ Id :<externalptr>
.. .. .. ..@ dbname : chr "my_db.sqlite3"
.. .. .. ..@ loadable.extensions: logi TRUE
.. .. .. ..@ flags : int 6
.. .. .. ..@ vfs : chr ""
.. ..$ path: chr "my_db.sqlite3"
.. ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
..$ x :Classes 'ident', 'sql', 'character' chr "flights"
..$ vars: chr [1:19] "year" "month" "day" "dep_time" ...
..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
$ dots:List of 3
..$ :List of 2
.. ..$ expr: language year == 2013
.. ..$ env :<environment: R_GlobalEnv>
.. ..- attr(*, "class")= chr "lazy"
..$ :List of 2
.. ..$ expr: language month == 1
.. ..$ env :<environment: R_GlobalEnv>
.. ..- attr(*, "class")= chr "lazy"
..$ :List of 2
.. ..$ expr: language day == 1
.. ..$ env :<environment: R_GlobalEnv>
.. ..- attr(*, "class")= chr "lazy"
..- attr(*, "class")= chr "lazy_dots"
$ args: list()
- attr(*, "class")= chr [1:3] "op_filter" "op_single" "op"
To be continued.
What about this going into the abstract tables interface instead? Then Jplyr can just call those generic methods. That would be more consistent with dataframes and DBs with specific SQL dialects can just implement the interface themselves.
Though it seems to require a dag/lazy eval, which brings us back to this package.
What about this going into the abstract tables interface instead?
Let's worry about that stuff later. There's so much work to be done here on just representing things with a DAG. Once that's done, it will be easier to address extensions.
Though I do think it is worth keeping the organization of AbstractTables vis a vis this package in the back of our minds. We may want to represent the notion of a tbl_lazy
in the type hierarchy, and that will require some figuring out.
@datnamer I'm not quite sure I understand the suggestion, and I'd like to understand it even though I agree it probably won't be relevant until some time later.
@davidagold - You pretty much got it there.
Following up on Part 2, to answer:
class(flights_sqlite)
[1] "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl"(TODO: understand how flights_sqlite comes to have this class structure.)
by tracing the sequence of calls from
> my_db <- src_sqlite("my_db.sqlite3", create = TRUE)
> library(nycflights13)
> flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = c("year", "month", "day"), "carrier", "tailnum"))
src_sqlite()
Examining src_sqlite
:
src_sqlite <- function(path, create = FALSE) {
if (!requireNamespace("RSQLite", quietly = TRUE)) {
stop("RSQLite package required to connect to sqlite db", call. = FALSE)
}
if (!create && !file.exists(path)) {
stop("Path does not exist and create = FALSE", call. = FALSE)
}
con <- DBI::dbConnect(RSQLite::SQLite(), path)
RSQLite::initExtension(con)
src_sql("sqlite", con, path = path)
}
We see that my_db <- src_sqlite("my_db.sqlite3", create=TRUE)
initiates a connection to a database, before creating a src_sql
object:
src_sql <- function(subclass, con, ...) {
subclass <- paste0("src_", subclass)
structure(list(con = con, ...), class = c(subclass, "src_sql", "src"))
}
under the subclass src_sqlite
:
> str(my_db)
List of 2
$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
.. ..@ Id :<externalptr>
.. ..@ dbname : chr "my_db.sqlite3"
.. ..@ loadable.extensions: logi TRUE
.. ..@ flags : int 6
.. ..@ vfs : chr ""
$ path: chr "my_db.sqlite3"
- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
copy_to()
We now trace the sequence of calls in
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = c("year", "month", "day"), "carrier", "tailnum"))
We begin with
copy_to.src_sql <- function(dest, df, name = deparse(substitute(df)),
types = NULL, temporary = TRUE,
unique_indexes = NULL, indexes = NULL,
analyze = TRUE, ...) {
assert_that(is.data.frame(df), is.string(name), is.flag(temporary))
class(df) <- "data.frame" # avoid S4 dispatch problem in dbSendPreparedQuery
if (isTRUE(db_has_table(dest$con, name))) {
stop("Table ", name, " already exists.", call. = FALSE)
}
types <- types %||% db_data_type(dest$con, df)
names(types) <- names(df)
con <- dest$con
db_begin(con)
on.exit(db_rollback(con))
db_create_table(con, name, types, temporary = temporary)
db_insert_into(con, name, df)
db_create_indexes(con, name, unique_indexes, unique = TRUE)
db_create_indexes(con, name, indexes, unique = FALSE)
if (analyze) db_analyze(con, name)
db_commit(con)
on.exit(NULL)
tbl(dest, name)
}
Brief aside:
deparse(substitute(โ ))
does the expected:> name <- deparse(substitute(flights)) > name [1] "flights"
Here's the cascade of calls from tbl(dest, name)
:
tbl.src_sqlite <- function(src, from, ...) {
tbl_sql("sqlite", src = src, from = from, ...)
}
tbl_sql <- function(subclass, src, from, ..., vars = attr(from, "vars")) {
make_tbl(
c(subclass, "sql", "lazy"),
src = src,
ops = op_base_remote(src, from, vars)
)
}
make_tbl <- function(subclass, ...) {
subclass <- paste0("tbl_", subclass)
structure(list(...), class = c(subclass, "tbl"))
}
which explains the hierarchy of classes "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl"
.
op_base_remote()
Let's wrap up the exploration with an examination of op_base_remote
:
op_base_remote <- function(src, x, vars = NULL) {
# If not literal sql, must be a table identifier
if (!is.sql(x)) {
x <- ident(x)
}
if (is.null(vars)) {
vars <- db_query_fields(src$con, x)
}
op_base("remote", src, x, vars)
}
where
db_query_fields()
should return a character vector giving the field names of a query,- db_query_fields.SQLiteConnection uses build_sql rather than paste0 (see dplyr/issue 926), and
- source:
op_base <- function(name, src, x, vars) {
stopifnot(is.character(vars))
structure(
list(
src = src,
x = x,
vars = vars
),
class = c(paste0("op_base_", name), "op_base", "op")
)
}
Going forward, I think we should maybe do one/two more case studies on
sql_build
(which is generic over the lazy operations, lazy_ops, and generates an S3 object that represents the query.) andsql_render
(which takes a query object and then calls a function that is generic over the database.)
For example, sql_build.op_mutate
generates a select_query
, and sql_render.select_query
calls sql_select
, which has different methods for different databases. I have been accumulating some notes, and might be able to free up some time to help work on a basic DAG -> SQL query string.
But before then, I think the final point of this exercise might be to decide whether to build up an (intermediate) "query object", that goes like
DAG ---
sql_build()
---> "query object" ---sql_render()
---> SQL string
and the factors that goes into such a decision.
Relevant reading(s):
Maybe it would be easiest to contact hadley himself and have a chat? I bet he would be willing to give some advice. If that ever happens, please record, because that would be a fascinating discussion to watch.