Add a `duckdb_to_parquet` using low level arrow functions
ddotta opened this issue · 3 comments
Just to signal that I wrote an enhanced interface to SQL that is in https://github.com/jllipatz/SQL. That's just WIP...
Originally posted by @jllipatz in #27 (comment)
_Originally posted by @jllipatz in #27 (comment)
Hello,
The couple dbSendQuery
, dbFetch
doesn't fit with duckdb : the query is solved before reaching the dbFetch
overfilling the RAM . Here is a solution that works without consuming a lot of RAM. Also it runs much more faster than simly including a COPY TO parquet
in the SQL query. Perhaps should it be the beginning of a new function in {parquetize} if somebody adds the partitioning ways that exist for the other functions.
`SQL2parquet <- function(query,path,chunk_size=1e6)
{
con <- dbConnect(duckdb::duckdb())
reader <- duckdb_fetch_record_batch(
dbSendQuery(con,query,arrow=TRUE),
chunk_size=chunk_size)
file <- FileOutputStream$create(path)
batch <- reader$read_next_batch()
if (!is.null(batch)) {
s <- batch$schema
writer <- ParquetFileWriter$create(s,file,
properties = ParquetWriterProperties$create(names(s)))
i <- 0
while (!is.null(batch)) {
i <- i+1
message(sprintf("%d, %d rows",i,nrow(batch)))
writer$WriteTable(arrow_table(batch),chunk_size=chunk_size)
batch <- NULL; gc()
batch <- reader$read_next_batch()
}
writer$Close()
}
file$close()
}
`
_Originally posted by @nbc in #27 (comment)
Hi @jllipatz thanks, I'm very interested, I think preparing a parquet file in duckdb could be a good use case but I don't feel comfortable enough in arrow's guts to start working on this for the moment. I must explore more.