Read in multiple csvs when file paths aren't amenable to glob syntax

Question

Read in multiple csvs when file paths aren't amenable to glob syntax

Closed this issue 14 days ago · 4 comments

I routinely work with multiple large csvs with a mess of file paths that aren't amenable to glob syntax. When working with duckdb I can supply these as, say SELECT * FROM read_csv([file_1.csv, file_2.csv]) and that works. I can't figure out how to do the equivalent in duckplyr.

I've tried:

file_paths <- c("file_1.csv", "file_2.csv) OR
file_paths <- list("file_1.csv", "file_2.csv")

duckplyr_df_from_csv(file_paths) %>% do_something

It doesn't error, but it only reads in the first file.

Is this possible? if so how? If not, I think there should at least be a warning if a list or vector of multiple file paths are passed.

Answer 1 · 2024-05-02T03:46:50.000Z

Thanks. Code like file_paths %>% map(duckplyr_df_from_csv) %>% bind_rows() has worked for me in practice, but I agree that this should be streamlined. Would you like to contribute a PR?

Answer 2 · 2024-05-02T03:52:16.000Z

I hadn't thouight to use map, thanks for the tip.

I'm sorry I do not have the experience or knowledge of how to do a PR :(

Answer 3 · 2024-07-08T20:44:27.000Z

bind_rows() reads into memory, %>% reduce(union_all) is better but will also read into memory in duckplyr 0.4.0 (works better in duckplyr 0.3.0): tidyverse/dplyr#7049 .

What should work is duckplyr_df_from_csv("file_*.csv"), but I hear this is not an option here, and I'm seeing mixed results too: duckdb/duckdb#12903 .

Action item: Implement bind_rows() to use reduce(union_all) under the hood.

Answer 4 · 2024-07-08T20:54:46.000Z

The action items here are a subset of those in #181 (comment), let's move the discussion there.