[Question] Best Practices for changing DuckDB Settings

Question

[Question] Best Practices for changing DuckDB Settings

Closed this issue 2 months ago · 5 comments

Hi all, thanks for developing such a fantastic package!

I frequently have to change configuration settings for DuckDB in order to get my queries to run, and am writing to ask: is there an established best practice for configuring DuckDB using duckplyr? So far, I have been changing the configuration settings using the DBI interface as follows:

con <- duckplyr:::get_default_duckdb_connection()
dbExecute(con, "SET preserve_insertion_order = false;") 
dbExecute(con, "SET memory_limit ='100GB';")

I'm sure that this isn't ideal as it relied on using an unexported function, get_default_duckdb_connection. I have also tried using the .duckdbrc file, but I was not confident that the settings were being properly applied (and it seemed less ergonomic to me). Is there a recommended way to change DuckDB settings using duckplyr?

Best, and thanks for your help,
Ben

Answer 1 · 2024-05-11T07:51:26.000Z

Thanks, Ben. I'm curious, what does preserve_insertion_order achieve for your use case?

Do we only need to set options? Should duckplyr support an R option that set these duckdb options at connect?

Answer 2 · 2024-05-13T03:46:59.000Z

Setting preserve_insertion_order to false allows me to import and export larger datasets to parquet, as is described at the bottom of this page. Sometimes, depending on the hardware I am using, I also need to set a memory limit or change the number of cores that can be used.

I am a novice DuckDB / duckplyr user so I am not sure that I can provide a very informed recommendation on the second point. I personally only have a need to be able to set DuckDB options, and it would be very helpful for me to be able to set through R so that they are applied at connect, as you suggest.

Answer 3 · 2024-05-13T03:53:33.000Z

Thanks, Ben. I'd say that for now, it's fine to use the unexported function while we're collecting use cases. I see there should be a better API for this at some point.

Answer 4 · 2024-05-13T03:56:14.000Z

Fantastic, thank you for your help!

Answer 5 · 2024-05-16T15:31:40.000Z

@beniaminogreen thanks for your tip.
@krlmlr my use case is related to setting threads for benchmarking (related #61), here is how I do it now:

duckdb_set_threads <- \(threads) {
  conn <- duckplyr:::get_default_duckdb_connection()
  dbExecute(conn = conn, paste0("PRAGMA threads='", threads, "'"))
  invisible()
}
duckdb_set_threads(8)
 df <- duckplyr::duckplyr_df_from_csv(file_name) |>
                summarize(
                    .by = state,
                    state_min = min(measurement),
                    state_max = max(measurement),
                    state_sum = sum(measurement),
                    state_n = n()
                ) |>
                mutate(state_mean = state_sum / state_n) |>
                select(state, state_min, state_mean, state_max) |>
                collect()