duckdblabs/duckplyr

Allow duckplyr to query tables in duckdb databases without intermediate materialization

Opened this issue · 0 comments

Potentially related to #86. Feel free to close if it's a duplicate.

If a user has a duckdb database with tables that are potentially 20GB large, it could be useful to queries those tables in duckplyr without any intermediate materialization. I've been trying to get this working with some slick work arounds but keep encountering errors. I think the errors are due to the multiple connections. One connection in the relational object, and another in duckplyr.

Some easy steps to reproduce

library(duckdb)
library(duckplyr)
library(conflicted)
conflict_prefer("filter", "duckplyr")

con <- DBI::dbConnect(duckdb("test.db"))
dbExecute(con, "create table foo as select range a from range(5000)")
rel_foo <- duckdb:::rel_from_table(con, "foo") 
altrep_df_foo <- duckdb:::rel_to_altrep(rel_foo)
duckdb:::df_is_materialized(altrep_df_foo)
# FALSE
duckplyr_df_foo <- as_duckplyr_df(altrep_df_foo)
duckplyr_df_foo %>% explain()
filtered <- duckplyr_df_foo %>% filter(a > 4999)

The error I get is then

  {"version":"0.3.2","message":"{\"exception_type\":\"Catalog\",\"exception_message\":\"Scalar Function with name >
  does not exist!\\nDid you mean \\\"@>\\\"?\",\"name\":\">\",\"candidates\":\"@>\",\"type\":\"Scalar
  Function\",\"error_subtype\":\"MISSING_ENTRY\"}","name":"filter","x":{"...1":"numeric"},"args":{"dots":{"1":"...1
  > 3990"},"by":"NULL","preserve":false}}

Let me know if there is more I can do from the duckdb side.

One possible solution might be to pass the desired connection you want duckplyr operating on to duckplyr? This can also serve as a way to prevent joins between relations in two connections? The macros can then be added to the passed connection as temporary macros. This means when the connection is closed the macros are discarded. If a user then passes a connection to duckplyr again, duckplyr can add the macros.

Would this work?