Study of dplyr SQL generation

Question

Study of dplyr SQL generation

Opened this issue 8 years ago · 7 comments

The purpose of this issue is to provide a description of the internals of dplyr's SQL generation. I intend it to serve as reference/inspiration for the development of such functionality in this package.

We first create a local SQLite3 database according to the vignette :

> my_db <- src_sqlite("my_db.sqlite3", create = TRUE)
> library(nycflights13)
> flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = c("year", "month", "day"), "carrier", "tailnum"))

(We are not concerned with the generation of indices at this point.)

It is worth examining the structure of flights_sqlite object:

> attributes(flights_sqlite)
$names
[1] "src" "ops"

$class
[1] "tbl_sqlite" "tbl_sql"    "tbl_lazy"   "tbl"

> str(flights_sqlite)
List of 2
 $ src:List of 2
  ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
  .. .. ..@ Id                 :<externalptr> 
  .. .. ..@ dbname             : chr "my_db.sqlite3"
  .. .. ..@ loadable.extensions: logi TRUE
  .. .. ..@ flags              : int 6
  .. .. ..@ vfs                : chr ""
  ..$ path: chr "my_db.sqlite3"
  ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
 $ ops:List of 3
  ..$ src :List of 2
  .. ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
  .. .. .. ..@ Id                 :<externalptr> 
  .. .. .. ..@ dbname             : chr "my_db.sqlite3"
  .. .. .. ..@ loadable.extensions: logi TRUE
  .. .. .. ..@ flags              : int 6
  .. .. .. ..@ vfs                : chr ""
  .. ..$ path: chr "my_db.sqlite3"
  .. ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
  ..$ x   :Classes 'ident', 'sql', 'character'  chr "flights"
  ..$ vars: chr [1:19] "year" "month" "day" "dep_time" ...
  ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
 - attr(*, "class")= chr [1:4] "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl"

We will see that the data manipulation commands provided by dplyr dispatch on the tbl_lazy class in flights_sqlite's class structure and are accumulated in its ops field.

For instance, let's trace the cascade of functions resulting from a call to

c1 <- filter(flights_sqlite, year == 2013, month == 1, day == 1)

The adventure starts in R/manip.R:

filter <- function(.data, ...) {
  filter_(.data, .dots = lazyeval::lazy_dots(...))
}

The filtering expressions (in this case year == 2013, month == 1, day == 1) are passed to lazyeval, which produces the following lazy_dots object

> .dots <- lazyeval::lazy_dots(year == 2013, month == 1, day == 1)
> str(.dots)
List of 3
 $ :List of 2
  ..$ expr: language year == 2013
  ..$ env :<environment: R_GlobalEnv> 
  ..- attr(*, "class")= chr "lazy"
 $ :List of 2
  ..$ expr: language month == 1
  ..$ env :<environment: R_GlobalEnv> 
  ..- attr(*, "class")= chr "lazy"
 $ :List of 2
  ..$ expr: language day == 1
  ..$ env :<environment: R_GlobalEnv> 
  ..- attr(*, "class")= chr "lazy"
 - attr(*, "class")= chr "lazy_dots"

which encodes the information (i.e. the un-evaluated filtering expressions) passed to the original filter command. The original data source, i.e. flights_sqlite and .dots are then passed to the eagerly-evaluating filter_ generic.

To find the specific method called on (flights_sqlite, .dots), we need to identify the class structure of flights_sqlite:

> class(flights_sqlite)
[1] "tbl_sqlite" "tbl_sql"    "tbl_lazy"   "tbl"

(TODO: understand how flights_sqlite comes to have this class structure.)
The relevant class, i.e. the one on which filter_ dispatches, turns out to be tbl_lazy, and so we examine the relevant method given in R/tbl_lazy.R:

filter_.tbl_lazy <- function(.data, ..., .dots) {
  dots <- lazyeval::all_dots(.dots, ...)
  add_op_single("filter", .data, dots = dots)
}

Since the ... argument is empty, the dots object returned by the lazyeval::all_dots call is "equivalent" (i.e., has the same structure and components) to the original .dots argument.

The purpose of filter_.tbl_lazy is ultimately to collect the information about the filtering expressions and store it in an op object, which is in turn stored in the ops field of the original tbl_lazy data source, i.e. flights_sqlite. This takes us to the implementation of add_op_single in the land of R/lazy_ops.R:

add_op_single <- function(name, .data, dots = list(), args = list()) {
  .data$ops <- op_single(name, x = .data$ops, dots = dots, args = args)
  .data
}

(We note that the ops field of flights_sqlite is initiated with the op_base_remote method).

op_single in turn returns an op object whose most specific class is determined by the manipulation command from which the call to add_op_single originated:

op_single <- function(name, x, dots = list(), args = list()) {
  structure(
    list(
      name = name,
      x = x,
      dots = dots,
      args = args
    ),
    class = c(paste0("op_", name), "op_single", "op")
  )
}

Thus we find that add_op_single (the "single" refers to the fact that filter is a single-table verb) modifies the ops field of the .data argument, i.e. flights_sqlite. In particular, since flights_sqlite$ops is passed as the x argument to op_single, the resultant op_filter object "points to" the original op_base_remote object initialized at creation of flights_sqlite. In this way dplyr lazily accumulates manipulation commands applied to remote data sources in a graph structure that can then be analyzed, flattened, and translated into SQL.

Indeed, if we examine the ops field of c1, we see that its value is an op_filter object that contains both the original op_base_remote object as well as all the requisite information about the filtering expressions passed to filter above:

> c1$ops
Source: sqlite 3.8.6 [my_db.sqlite3]
From: flights
<Table: flights>
-> filter()
   - year == 2013
   - month == 1
   - day == 1
> str(c1$ops)
List of 4
 $ name: chr "filter"
 $ x   :List of 3
  ..$ src :List of 2
  .. ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
  .. .. .. ..@ Id                 :<externalptr> 
  .. .. .. ..@ dbname             : chr "my_db.sqlite3"
  .. .. .. ..@ loadable.extensions: logi TRUE
  .. .. .. ..@ flags              : int 6
  .. .. .. ..@ vfs                : chr ""
  .. ..$ path: chr "my_db.sqlite3"
  .. ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
  ..$ x   :Classes 'ident', 'sql', 'character'  chr "flights"
  ..$ vars: chr [1:19] "year" "month" "day" "dep_time" ...
  ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
 $ dots:List of 3
  ..$ :List of 2
  .. ..$ expr: language year == 2013
  .. ..$ env :<environment: R_GlobalEnv> 
  .. ..- attr(*, "class")= chr "lazy"
  ..$ :List of 2
  .. ..$ expr: language month == 1
  .. ..$ env :<environment: R_GlobalEnv> 
  .. ..- attr(*, "class")= chr "lazy"
  ..$ :List of 2
  .. ..$ expr: language day == 1
  .. ..$ env :<environment: R_GlobalEnv> 
  .. ..- attr(*, "class")= chr "lazy"
  ..- attr(*, "class")= chr "lazy_dots"
 $ args: list()
 - attr(*, "class")= chr [1:3] "op_filter" "op_single" "op"

To be continued.

Answer 1 · 2016-07-26T17:40:07.000Z

What about this going into the abstract tables interface instead? Then Jplyr can just call those generic methods. That would be more consistent with dataframes and DBs with specific SQL dialects can just implement the interface themselves.

Though it seems to require a dag/lazy eval, which brings us back to this package.

Answer 2 · 2016-07-26T17:55:22.000Z

What about this going into the abstract tables interface instead?

Let's worry about that stuff later. There's so much work to be done here on just representing things with a DAG. Once that's done, it will be easier to address extensions.

Answer 3 · 2016-07-26T18:20:58.000Z

Though I do think it is worth keeping the organization of AbstractTables vis a vis this package in the back of our minds. We may want to represent the notion of a tbl_lazy in the type hierarchy, and that will require some figuring out.

@datnamer I'm not quite sure I understand the suggestion, and I'd like to understand it even though I agree it probably won't be relevant until some time later.

Answer 4 · 2016-07-27T01:26:48.000Z

@davidagold - You pretty much got it there.

Answer 5 · 2016-07-27T15:40:02.000Z

Following up on Part 2, to answer:

class(flights_sqlite)
[1] "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl"

(TODO: understand how flights_sqlite comes to have this class structure.)

by tracing the sequence of calls from

> my_db <- src_sqlite("my_db.sqlite3", create = TRUE)
> library(nycflights13)
> flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = c("year", "month", "day"), "carrier", "tailnum"))

src_sqlite()

Examining src_sqlite:

src_sqlite <- function(path, create = FALSE) {
  if (!requireNamespace("RSQLite", quietly = TRUE)) {
    stop("RSQLite package required to connect to sqlite db", call. = FALSE)
  }
  if (!create && !file.exists(path)) {
    stop("Path does not exist and create = FALSE", call. = FALSE)
  }
  con <- DBI::dbConnect(RSQLite::SQLite(), path)
  RSQLite::initExtension(con)
  src_sql("sqlite", con, path = path)
}

We see that my_db <- src_sqlite("my_db.sqlite3", create=TRUE) initiates a connection to a database, before creating a src_sql object:

src_sql <- function(subclass, con, ...) {
  subclass <- paste0("src_", subclass)
  structure(list(con = con, ...), class = c(subclass, "src_sql", "src"))
}

under the subclass src_sqlite:

> str(my_db)
List of 2
 $ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
  .. ..@ Id                 :<externalptr>
  .. ..@ dbname             : chr "my_db.sqlite3"
  .. ..@ loadable.extensions: logi TRUE
  .. ..@ flags              : int 6
  .. ..@ vfs                : chr ""
 $ path: chr "my_db.sqlite3"
 - attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"

copy_to()

We now trace the sequence of calls in

flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = c("year", "month", "day"), "carrier", "tailnum"))

We begin with

copy_to.src_sql <- function(dest, df, name = deparse(substitute(df)),
                            types = NULL, temporary = TRUE,
                            unique_indexes = NULL, indexes = NULL,
                            analyze = TRUE, ...) {
  assert_that(is.data.frame(df), is.string(name), is.flag(temporary))
  class(df) <- "data.frame" # avoid S4 dispatch problem in dbSendPreparedQuery

  if (isTRUE(db_has_table(dest$con, name))) {
    stop("Table ", name, " already exists.", call. = FALSE)
  }

  types <- types %||% db_data_type(dest$con, df)
  names(types) <- names(df)

  con <- dest$con
  db_begin(con)
  on.exit(db_rollback(con))

  db_create_table(con, name, types, temporary = temporary)
  db_insert_into(con, name, df)
  db_create_indexes(con, name, unique_indexes, unique = TRUE)
  db_create_indexes(con, name, indexes, unique = FALSE)
  if (analyze) db_analyze(con, name)

  db_commit(con)
  on.exit(NULL)

  tbl(dest, name)
}

Brief aside: deparse(substitute(⋅)) does the expected:
> name <- deparse(substitute(flights))
> name
[1] "flights"

Here's the cascade of calls from tbl(dest, name):

source:

tbl.src_sqlite <- function(src, from, ...) {
  tbl_sql("sqlite", src = src, from = from, ...)
}

source:

tbl_sql <- function(subclass, src, from, ..., vars = attr(from, "vars")) {
  make_tbl(
    c(subclass, "sql", "lazy"),
    src = src,
    ops = op_base_remote(src, from, vars)
  )
}

source:

make_tbl <- function(subclass, ...) {
  subclass <- paste0("tbl_", subclass)
  structure(list(...), class = c(subclass, "tbl"))
}

which explains the hierarchy of classes "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl".

op_base_remote()

Let's wrap up the exploration with an examination of op_base_remote:

op_base_remote <- function(src, x, vars = NULL) {
  # If not literal sql, must be a table identifier
  if (!is.sql(x)) {
    x <- ident(x)
  }
  if (is.null(vars)) {
    vars <- db_query_fields(src$con, x)
  }
  op_base("remote", src, x, vars)
}

where

op_base <- function(name, src, x, vars) {
  stopifnot(is.character(vars))
  structure(
    list(
      src = src,
      x = x,
      vars = vars
    ),
    class = c(paste0("op_base_", name), "op_base", "op")
  )
}

Answer 6 · 2016-07-27T16:20:28.000Z

Going forward, I think we should maybe do one/two more case studies on

sql_build (which is generic over the lazy operations, lazy_ops, and generates an S3 object that represents the query.) and
sql_render (which takes a query object and then calls a function that is generic over the database.)

For example, sql_build.op_mutate generates a select_query, and sql_render.select_query calls sql_select, which has different methods for different databases. I have been accumulating some notes, and might be able to free up some time to help work on a basic DAG -> SQL query string.

But before then, I think the final point of this exercise might be to decide whether to build up an (intermediate) "query object", that goes like

DAG ---sql_build()---> "query object" ---sql_render()---> SQL string

and the factors that goes into such a decision.

Relevant reading(s):

https://github.com/hadley/dplyr/blob/master/NEWS.md#internals

Answer 7 · 2016-08-03T19:45:30.000Z

Maybe it would be easiest to contact hadley himself and have a chat? I bet he would be willing to give some advice. If that ever happens, please record, because that would be a fascinating discussion to watch.