duckdblabs/duckplyr

Need to be stricter about column compatibility in joins

Closed this issue · 5 comments

Also, use simple identity instead of r_base::== for joins.

dplyr::union_all(data.frame(a = 1), data.frame(a = "2"))
#> Error in `dplyr::union_all()`:
#> ! `x` and `y` are not compatible.
#> ✖ Incompatible types for column `a`: double vs character.
dplyr::setdiff(data.frame(a = 1), data.frame(a = "2"))
#> Error in `dplyr::setdiff()`:
#> ! `x` and `y` are not compatible.
#> ✖ Incompatible types for column `a`: double vs character.
dplyr::left_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> Error in `dplyr::left_join()`:
#> ! Can't join `x$a` with `y$a` due to incompatible types.
#> ℹ `x$a` is a <double>.
#> ℹ `y$a` is a <character>.
dplyr::semi_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> Error in `dplyr::semi_join()`:
#> ! Can't join `x$a` with `y$a` due to incompatible types.
#> ℹ `x$a` is a <double>.
#> ℹ `y$a` is a <character>.

duckplyr:::duckplyr_union_all(data.frame(a = 1), data.frame(a = "2"))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Union
#>   r_dataframe_scan(0x11d52b258)  r_dataframe_scan(0x11d0c4e38)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (VARCHAR)
#> 
#>     a
#> 1 1.0
#> 2   2
duckplyr:::duckplyr_setdiff(data.frame(a = 1), data.frame(a = "2"))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Except
#>   r_dataframe_scan(0x11d556c70)  r_dataframe_scan(0x11d55aea0)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (VARCHAR)
#> 
#>     a
#> 1 1.0
duckplyr:::duckplyr_left_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> Error in `left_join()` at duckplyr/R/left_join.R:19:3:
#> ! Can't join `x$a` with `y$a` due to incompatible types.
#> ℹ `x$a` is a <double>.
#> ℹ `y$a` is a <character>.
duckplyr:::duckplyr_semi_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Join REGULAR SEMI ___eq_na_matches_na(lhs.a, rhs.a)
#>   r_dataframe_scan(0x14aace6d0)
#>   r_dataframe_scan(0x13b892678)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (DOUBLE)
#> 
#> [1] a
#> <0 rows> (or 0-length row.names)

Created on 2024-05-16 with reprex v2.1.0

Hi, just to add, I've found an issue that I think is related:

I was finding the maximum value for several variables using the following code:

vector_of_file_paths |>
  map(\(path)
    duckplyr_df_from_csv(path) |>
    select(any_of(vector_of_numeric_col_names)) |>
    rename_with(tolower) |>
    head(5) |>
    summarise(across(everything(), ~ max(.x, na.rm = T))) |>
 bind_rows()

and it was failing at the last step when a column only happened to have NAs in the first 5 rows. If I didn't row bind, I could see the result column had a value of -Inf. I can see from the output during run that these result columns were interpreted by duckplyr as "varchar"

I am unable to share logs from the environment I work in, sorry.

Thanks, @nicki-dese, this is a different problem. Can you please open a new issue?

A reprex would be very useful. This would be a self-contained example with toy data, see, e.g., https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html .

Set operations are good now, joins need tidyverse/dplyr#7029.

Done now.