duckdblabs/duckplyr

Copying the data frame to disk

Closed this issue · 8 comments

My impression at the moment is that if you use as_duckplyr_df(), then duckdb is actually processing the data frame in R's allocated memory. This may or may not be correct.

What I want to do is first save the data frame to disk in an efficient format (hopefully the DuckDB binary format, but parquet would work as well), and then re-open the data frame for querying on disk. Is there currently a way to do this, or a plan for how to do this with duckplyr?

Do we need to add more to our README? Happy to review a PR!


See ?duckplyr_df_from_file, this will keep the file on disk. The object returned is a data frame, but the data will be read only if you request column values or the number of rows. You can try with a large Parquet file: duckplyr_df_from_file() is instantaneous, but querying the data for the first time will take time. (In the RStudio IDE, you may need to upgrade to the latest version or to set "Manual refresh only" in the "Environment pane".)

image

When such a data frame is processed with dplyr, the processing happens on DuckDB and is using DuckDB memory. Only when a result is collected (either manually as above or because the code requests an operation that can't be run in DuckDB yet) the data is materialized as a data frame.

Materialization also emits a message by default, see ?config.

Thanks for the reply. So I understood duckplyr_df_from_file reads from disk, and I think the docs for that are fine. My use case was taking a standard in-memory R data frame, persisting it to disk and then re-opening it as a duckplyr dataframe. I guess it would be a bit rare to do this though, because if the data frame fits in memory as is, then there's no reason to put it on disk. I guess generally speaking I'm looking for a save implementation such as #87, but including the DuckDB binary format as well as parquet.

Parquet is the safest option for now. The DuckDB database that duckplyr uses is currently an opaque implementation detail, and supporting export to a different connection/database file is not straightforward.

@Tmonster: please correct me if I'm misrepresenting.

@krlmlr you are right, although creating a duckdb database for storing the data frame contents should be fairly straightforward.

persistent_con <- dbConnect(duckdb(), dbdir="persisted_table.db"))
dbWriteTable(persistent_con, "my_persistent_table", my_df)

I haven't tested this though, so unsure if it really is this easy.
The file persisted_table.db can then be opened using duckdb in any other client (CLI, python, etc.)
You can see more using the R api docs here.

Also working on a to_parquet function for duckplyr here

How can we pour the results of a query speicified by a relational object into a new table (possibly in a different database connection)? Do we need rel_to_table() ?

I think in this case we would need a rel_to_table() function. I imagine this is for the case where a df cannot fit in memory?

If not, we could also convert the relation object to a normal df and then use dbWriteTable. If we do that, you can write to different databases in the following (slightly hacky) way.

con <- dbConnect(duckdb(), dbdir='db1.db')
dbSendQuery(con, "attach 'db2.db'")
dbSendQuery(con, "use db2;")
dbWriteTable(con, "cars", mtcars)
dbSendQuery(con, "use db1;")
dbWriteTable(con, "cars", mtcars)

You can then verify that both databases have a copy of cars by opening them individually. Unfortunately I don't think we have a relational way yet to change databases.

Potentially also something I can add to the duckdb-r client

Most importantly, this would save the roundtrip through R memory. A rel_to_table() would be very much appreciated.

For writing to other databases, I suspect we can always attach them to our main database and write to a schema? Would that work?

Most importantly, this would save the roundtrip through R memory. A rel_to_table() would be very much appreciated.

Ah, I agree that would be helpful. I can look into this.

For writing to other databases, I suspect we can always attach them to our main database and write to a schema? Would that work?

Yes this will work. I can probably write a rel_to_table() function that writes a relation to table qualified by some user provided string, which is either table, or schema.table, or db.schema.table