duckplyr with remote tables in a duckdb connection
Opened this issue · 6 comments
Hello,
Trying out duckplyr with a remote table results in the following error. Is duckplyr
compatible with remote tables in a duckdb connection? Not sure if the idea is to use dbplyr
when working with remote tables, and duckplyr
when working with data in memory. Feedback much appreciated.
Thank you.
-Ed
Error:
con = DBI::dbConnect(duckdb::duckdb(), "foo.db") #db on disk
DBI::dbWriteTable(con, name = "iris", value = iris)
dplyr::tbl(con, "iris") |> duckplyr::as_duckplyr_df()
Error in `duckplyr::as_duckplyr_df()`:
! Must pass a plain data frame or a tibble to `as_duckplyr_df()`.
Traceback:
1. duckplyr::as_duckplyr_df(dplyr::tbl(con, "iris"))
2. abort("Must pass a plain data frame or a tibble to `as_duckplyr_df()`.")
3. signal_abort(cnd, .file)
Thanks, good question. I see two components:
- Use in a new project
- Use in an existing project
New project
No need to bother with connections. Start with data frames, use as_duckplyr_df()
, from that point on the queries will be processed by duckdb where possible.
Internally, duckplyr uses a DBI connection to duckdb, but this is not meant to be accessed by the user. There is currently no way to specify the location of the database file for this internal connection. Do you think we need an option for this to avoid keeping everything in memory?
Existing project
Because of the internal DBI connection, it is difficult to mix dbplyr code and duckplyr code. I wonder how to make this more seamless. Ideally, as_duckplyr_df()
would work for dbplyr tbl objects.
Sketch (with a dummy relational object and unexported functions):
con <- DBI::dbConnect(duckdb::duckdb(), "foo.db") # db on disk
DBI::dbWriteTable(con, name = "iris", value = iris)
tbl <-
dplyr::tbl(con, "iris") |>
dplyr::filter(Petal.Length <= 1.2)
tbl |> dplyr::show_query()
#> <SQL>
#> SELECT iris.*
#> FROM iris
#> WHERE ("Petal.Length" <= 1.2)
# Dummy rel object
rel <- duckdb:::rel_from_df(con, data.frame(a = integer()))
duckdb:::rel_sql(rel, dbplyr::sql_render(tbl))
#> # A tibble: 4 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 4.3 3 1.1 0.1 setosa
#> 2 5.8 4 1.2 0.2 setosa
#> 3 4.6 3.6 1 0.2 setosa
#> 4 5 3.2 1.2 0.2 setosa
Created on 2023-11-20 with reprex v2.0.2
Hello @krlmlr,
Do you think we need an option for this to avoid keeping everything in memory?
Yes I think this would be helpful.
Perhaps an optional con
param in as_duckplyr_df()
could do the trick? In the event that a user wants to use a duckplyr_df
on disk, a con
of type duckdb_connection
could be passed. Also, to allow multiple duckplyr_df
in the same con
, it may also make sense to have a name
param to name each in-memory object for their remote table counterpart. I believe name
, con
follow the conventions in DBI
.
We allow multiple duckplyr_df
objects in the same con already.
The opposite may be a bit trickier. @Tmonster: are there any obstacles combining multiple relational objects that were created from different connections, e.g., with joins?
I was thinking about an option to configure the default connection too, but passing the connection object may be the easiest.
I'm a bit confused. For my clarity, remote means a duckdb table in a different duckdb database file right? Or connection I guess? Since duckplyr maintains it's own connection to a duckdb database?
I'll have to look into it, but combining relational objects from two different duckdb connections might be difficult.
I think it might be easier to integrate the attach/detach functionality that duckdb has.
If a user has other existing duckdb database files and they want to use duckplyr functionality without calling as.data.frame()
, then we could expose some kind of duckplyr_attach_db()
function.
Would this work?
If we can't mix and match relational objects from different connections, we should check that they are the same for joins and other operations.
We'll also take a look into connections and database storage modes.
For joins we already check if they are the same. See join_relation.cpp
in src/main/relation/join_relation.cpp
in the duckdb main code