Need to be stricter about column compatibility in joins

Question

Need to be stricter about column compatibility in joins

Closed this issue 12 days ago · 5 comments

Also, use simple identity instead of r_base::== for joins.

dplyr::union_all(data.frame(a = 1), data.frame(a = "2"))
#> Error in `dplyr::union_all()`:
#> ! `x` and `y` are not compatible.
#> ✖ Incompatible types for column `a`: double vs character.
dplyr::setdiff(data.frame(a = 1), data.frame(a = "2"))
#> Error in `dplyr::setdiff()`:
#> ! `x` and `y` are not compatible.
#> ✖ Incompatible types for column `a`: double vs character.
dplyr::left_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> Error in `dplyr::left_join()`:
#> ! Can't join `x$a` with `y$a` due to incompatible types.
#> ℹ `x$a` is a <double>.
#> ℹ `y$a` is a <character>.
dplyr::semi_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> Error in `dplyr::semi_join()`:
#> ! Can't join `x$a` with `y$a` due to incompatible types.
#> ℹ `x$a` is a <double>.
#> ℹ `y$a` is a <character>.

duckplyr:::duckplyr_union_all(data.frame(a = 1), data.frame(a = "2"))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Union
#>   r_dataframe_scan(0x11d52b258)  r_dataframe_scan(0x11d0c4e38)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (VARCHAR)
#> 
#>     a
#> 1 1.0
#> 2   2
duckplyr:::duckplyr_setdiff(data.frame(a = 1), data.frame(a = "2"))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Except
#>   r_dataframe_scan(0x11d556c70)  r_dataframe_scan(0x11d55aea0)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (VARCHAR)
#> 
#>     a
#> 1 1.0
duckplyr:::duckplyr_left_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> Error in `left_join()` at duckplyr/R/left_join.R:19:3:
#> ! Can't join `x$a` with `y$a` due to incompatible types.
#> ℹ `x$a` is a <double>.
#> ℹ `y$a` is a <character>.
duckplyr:::duckplyr_semi_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Join REGULAR SEMI ___eq_na_matches_na(lhs.a, rhs.a)
#>   r_dataframe_scan(0x14aace6d0)
#>   r_dataframe_scan(0x13b892678)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (DOUBLE)
#> 
#> [1] a
#> <0 rows> (or 0-length row.names)

^{Created on 2024-05-16 with reprex v2.1.0}

krlmlr commented 12 days ago

Done now.

Answer 1 · 2024-05-20T00:50:24.000Z

Hi, just to add, I've found an issue that I think is related:

I was finding the maximum value for several variables using the following code:

vector_of_file_paths |>
  map(\(path)
    duckplyr_df_from_csv(path) |>
    select(any_of(vector_of_numeric_col_names)) |>
    rename_with(tolower) |>
    head(5) |>
    summarise(across(everything(), ~ max(.x, na.rm = T))) |>
 bind_rows()

and it was failing at the last step when a column only happened to have NAs in the first 5 rows. If I didn't row bind, I could see the result column had a value of -Inf. I can see from the output during run that these result columns were interpreted by duckplyr as "varchar"

I am unable to share logs from the environment I work in, sorry.

Answer 2 · 2024-05-20T16:48:00.000Z

Thanks, @nicki-dese, this is a different problem. Can you please open a new issue?

A reprex would be very useful. This would be a self-contained example with toy data, see, e.g., https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html .

Answer 3 · 2024-05-20T17:26:44.000Z

Set operations are good now, joins need tidyverse/dplyr#7029.

Answer 4 · 2024-06-13T23:32:06.000Z

Sorry for the delayed reply. I tried generating a reprex and it didn’t error in the same way. If I can manage to reproduce, I’ll post a new issue. Nicki Norris Assistant Director - Data Phone (02) 6240 8969 From: Kirill Müller ***@***.***> Sent: Tuesday, May 21, 2024 2:48 AM To: duckdblabs/duckplyr ***@***.***> Cc: NORRIS,Nicki ***@***.***>; Mention ***@***.***> Subject: Re: [duckdblabs/duckplyr] Need to be stricter about column compatibility (Issue #168) CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe. Thanks, @nicki-dese<https://github.com/nicki-dese>, this is a different problem. Can you please open a new issue? A reprex would be very useful. This would be a self-contained example with toy data, see, e.g., https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html . — Reply to this email directly, view it on GitHub<#168 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/APJUET37UGV7JJ4GWTEVKZLZDISNNAVCNFSM6AAAAABHZJ4TVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRQHAZDENJQGQ>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>> Notice: The information contained in this email message and any attached files may be confidential information, and may also be the subject of legal professional privilege. If you are not the intended recipient, any use, disclosure or copying of this email is unauthorised. If you received this email in error, please notify the sender by contacting the department's switchboard on 1300 566 046 during business hours (8:30am - 5pm Canberra time) and delete all copies of this transmission together with any attachments.