ddotta/parquetize

Add a `duckdb_to_parquet` using low level arrow functions

ddotta opened this issue · 3 comments

Just to signal that I wrote an enhanced interface to SQL that is in https://github.com/jllipatz/SQL. That's just WIP...

Originally posted by @jllipatz in #27 (comment)

_Originally posted by @jllipatz in #27 (comment)

Hello,

The couple dbSendQuery, dbFetch doesn't fit with duckdb : the query is solved before reaching the dbFetch overfilling the RAM . Here is a solution that works without consuming a lot of RAM. Also it runs much more faster than simly including a COPY TO parquet in the SQL query. Perhaps should it be the beginning of a new function in {parquetize} if somebody adds the partitioning ways that exist for the other functions.

`SQL2parquet <- function(query,path,chunk_size=1e6)
{
con <- dbConnect(duckdb::duckdb())

reader <- duckdb_fetch_record_batch(
dbSendQuery(con,query,arrow=TRUE),
chunk_size=chunk_size)

file <- FileOutputStream$create(path)
batch <- reader$read_next_batch()
if (!is.null(batch)) {
s <- batch$schema
writer <- ParquetFileWriter$create(s,file,
properties = ParquetWriterProperties$create(names(s)))

i <- 0
while (!is.null(batch)) {
  i <- i+1
  message(sprintf("%d, %d rows",i,nrow(batch)))
  writer$WriteTable(arrow_table(batch),chunk_size=chunk_size)
  batch <- NULL; gc()
  batch <- reader$read_next_batch()
}

writer$Close()

}
file$close()
}
`

_Originally posted by @nbc in #27 (comment)

Hi @jllipatz thanks, I'm very interested, I think preparing a parquet file in duckdb could be a good use case but I don't feel comfortable enough in arrow's guts to start working on this for the moment. I must explore more.

I agree with @nbc. I find your idea @jjlipatz really very interesting and promising 🚀
Only this represents a fairly high entry cost to master all these low-level features from arrow 😢