`df_from_parquet()` doesn't take a directory name?

Question

`df_from_parquet()` doesn't take a directory name?

Closed this issue 2 months ago · 5 comments

> logs <- duckplyr::df_from_parquet("logs/")
Error:
! {"exception_type":"IO","exception_message":"No files found that match the pattern \"logs/\""}
Show Traceback

But this is ok:

logs <- duckplyr::df_from_parquet(dir("logs/", full.names = TRUE))

Answer 1 · 2024-07-05T17:01:01.000Z

Oh but that second one only reads a single data frame.

What actually works is duckplyr_df_from_parquet("logs/*.parquet")

Answer 2 · 2024-07-05T17:09:57.000Z

But even when I get it working it's basically unusable. I'll try find some time to show @krlmlr in person next week.

Answer 3 · 2024-07-06T10:18:01.000Z

Does bind_rows() give you a better experience? See #146 (comment) :

file_paths %>% map(duckplyr_df_from_parquet) %>% bind_rows()

Answer 4 · 2024-07-08T20:54:18.000Z

bind_rows() reads into memory, %>% reduce(union_all) is better but will also read into memory in duckplyr 0.4.0 (works better in duckplyr 0.3.0): tidyverse/dplyr#7049 .

What should work is duckplyr_df_from_csv("file_*.csv"), but I'm seeing mixed results too: duckdb/duckdb#12903 .

Action items:

Check that path is a scalar
Fix union_all() -- work around tidyverse/dplyr#7049
Implement bind_rows() to use reduce(union_all) under the hood.
Is there a canonical helper for the map(paths, fun) %>% bind_rows() pattern that we can point to in the documentation?

Answer 5 · 2024-07-10T18:50:00.000Z

bind_rows() is not a generic, unfortunately. Will solve with documentation.