duckdblabs/duckplyr

add new data by columns into duckdb out of memory

Closed this issue · 2 comments

I have incoming data that I want to store on disk in a database or something. The data looks something like this

incoming_data <- function(ncol=5){
  dat <- sample(1:10,100,replace = T) |> matrix(ncol = ncol) |> as.data.frame()
  random_names <- sapply(1:ncol(dat),\(x) paste0(sample(letters,1), sample(1:100,1)))
  colnames(dat) <- random_names
  dat
}
incoming_data()

This incoming_data is just for example.. In reality, one incoming_data set will have several 5k rows and about 50k columns. And the entire final file will be about 200-400 gigabytes

My question is how to add new data as columns to the database without loading the file into RAM

# your way
path <- "D:\\R_scripts\\new\\duckdb\\data\\DB.duckdb"
library(duckdb)
library(duckplyr)
con <- dbConnect(duckdb(), dbdir = path, read_only = FALSE)
#  write one piece of data in DB
dbWriteTable(con, "my_dat", incoming_data())


#### how to make something like this ####
my_dat <- cbind("my_dat", incoming_data())

Thanks. This is a very broad question, and not a good fit for this issue tracker. Either way, 50k columns sounds like way too many. Any chance you can the data "longer"?

Thanks for your lightning fast response!
Yes I can keep the data "longer".

I understand that my question doesn't really fit the format and I apologize for that, but I would be very grateful for your help