[Question] Best Practices for changing DuckDB Settings
Closed this issue · 5 comments
Hi all, thanks for developing such a fantastic package!
I frequently have to change configuration settings for DuckDB in order to get my queries to run, and am writing to ask: is there an established best practice for configuring DuckDB using duckplyr
? So far, I have been changing the configuration settings using the DBI interface as follows:
con <- duckplyr:::get_default_duckdb_connection()
dbExecute(con, "SET preserve_insertion_order = false;")
dbExecute(con, "SET memory_limit ='100GB';")
I'm sure that this isn't ideal as it relied on using an unexported function, get_default_duckdb_connection
. I have also tried using the .duckdbrc
file, but I was not confident that the settings were being properly applied (and it seemed less ergonomic to me). Is there a recommended way to change DuckDB settings using duckplyr
?
Best, and thanks for your help,
Ben
Thanks, Ben. I'm curious, what does preserve_insertion_order
achieve for your use case?
Do we only need to set options? Should duckplyr support an R option that set these duckdb options at connect?
Setting preserve_insertion_order
to false allows me to import and export larger datasets to parquet, as is described at the bottom of this page. Sometimes, depending on the hardware I am using, I also need to set a memory limit or change the number of cores that can be used.
I am a novice DuckDB / duckplyr user so I am not sure that I can provide a very informed recommendation on the second point. I personally only have a need to be able to set DuckDB options, and it would be very helpful for me to be able to set through R so that they are applied at connect, as you suggest.
Thanks, Ben. I'd say that for now, it's fine to use the unexported function while we're collecting use cases. I see there should be a better API for this at some point.
Fantastic, thank you for your help!
@beniaminogreen thanks for your tip.
@krlmlr my use case is related to setting threads for benchmarking (related #61), here is how I do it now:
duckdb_set_threads <- \(threads) {
conn <- duckplyr:::get_default_duckdb_connection()
dbExecute(conn = conn, paste0("PRAGMA threads='", threads, "'"))
invisible()
}
duckdb_set_threads(8)
df <- duckplyr::duckplyr_df_from_csv(file_name) |>
summarize(
.by = state,
state_min = min(measurement),
state_max = max(measurement),
state_sum = sum(measurement),
state_n = n()
) |>
mutate(state_mean = state_sum / state_n) |>
select(state, state_min, state_mean, state_max) |>
collect()