tidyverse/duckplyr

`df_from_parquet()` doesn't take a directory name?

Closed this issue · 5 comments

> logs <- duckplyr::df_from_parquet("logs/")
Error:
! {"exception_type":"IO","exception_message":"No files found that match the pattern \"logs/\""}
Show Traceback

But this is ok:

logs <- duckplyr::df_from_parquet(dir("logs/", full.names = TRUE))

Oh but that second one only reads a single data frame.

What actually works is duckplyr_df_from_parquet("logs/*.parquet")

But even when I get it working it's basically unusable. I'll try find some time to show @krlmlr in person next week.

Does bind_rows() give you a better experience? See #146 (comment) :

file_paths %>% map(duckplyr_df_from_parquet) %>% bind_rows()

bind_rows() reads into memory, %>% reduce(union_all) is better but will also read into memory in duckplyr 0.4.0 (works better in duckplyr 0.3.0): tidyverse/dplyr#7049 .

What should work is duckplyr_df_from_csv("file_*.csv"), but I'm seeing mixed results too: duckdb/duckdb#12903 .

Action items:

  • Check that path is a scalar
  • Fix union_all() -- work around tidyverse/dplyr#7049
  • Implement bind_rows() to use reduce(union_all) under the hood.
  • Is there a canonical helper for the map(paths, fun) %>% bind_rows() pattern that we can point to in the documentation?

bind_rows() is not a generic, unfortunately. Will solve with documentation.