duckdblabs/duckplyr

Read in multiple csvs when file paths aren't amenable to glob syntax

Closed this issue · 4 comments

I routinely work with multiple large csvs with a mess of file paths that aren't amenable to glob syntax. When working with duckdb I can supply these as, say SELECT * FROM read_csv([file_1.csv, file_2.csv]) and that works. I can't figure out how to do the equivalent in duckplyr.

I've tried:

file_paths <- c("file_1.csv", "file_2.csv) OR
file_paths <- list("file_1.csv", "file_2.csv")

duckplyr_df_from_csv(file_paths) %>% do_something

It doesn't error, but it only reads in the first file.

Is this possible? if so how? If not, I think there should at least be a warning if a list or vector of multiple file paths are passed.

Thanks. Code like file_paths %>% map(duckplyr_df_from_csv) %>% bind_rows() has worked for me in practice, but I agree that this should be streamlined. Would you like to contribute a PR?

I hadn't thouight to use map, thanks for the tip.

I'm sorry I do not have the experience or knowledge of how to do a PR :(

bind_rows() reads into memory, %>% reduce(union_all) is better but will also read into memory in duckplyr 0.4.0 (works better in duckplyr 0.3.0): tidyverse/dplyr#7049 .

What should work is duckplyr_df_from_csv("file_*.csv"), but I hear this is not an option here, and I'm seeing mixed results too: duckdb/duckdb#12903 .

Action item: Implement bind_rows() to use reduce(union_all) under the hood.

The action items here are a subset of those in #181 (comment), let's move the discussion there.