duckdblabs/duckplyr

Fix descending sort for character data

Closed this issue · 4 comments

options(conflicts.policy = list(warn = FALSE))
library(duckplyr)

Sys.setenv("DUCKPLYR_FORCE" = "TRUE")

data.frame(a = letters[1:3]) |>
  as_duckplyr_df() |>
  arrange(desc(a))
#> Error: Binder Error: No function matches the given name and argument types '-(VARCHAR)'. You might need to add explicit type casts.
#>  Candidate functions:
#>  -(TINYINT) -> TINYINT
#>  -(TINYINT, TINYINT) -> TINYINT
#>  -(SMALLINT) -> SMALLINT
#>  -(SMALLINT, SMALLINT) -> SMALLINT
#>  -(INTEGER) -> INTEGER
#>  -(INTEGER, INTEGER) -> INTEGER
#>  -(BIGINT) -> BIGINT
#>  -(BIGINT, BIGINT) -> BIGINT
#>  -(HUGEINT) -> HUGEINT
#>  -(HUGEINT, HUGEINT) -> HUGEINT
#>  -(FLOAT) -> FLOAT
#>  -(FLOAT, FLOAT) -> FLOAT
#>  -(DOUBLE) -> DOUBLE
#>  -(DOUBLE, DOUBLE) -> DOUBLE
#>  -(DECIMAL) -> DECIMAL
#>  -(DECIMAL, DECIMAL) -> DECIMAL
#>  -(UTINYINT) -> UTINYINT
#>  -(UTINYINT, UTINYINT) -> UTINYINT
#>  -(USMALLINT) -> USMALLINT
#>  -(USMALLINT, USMALLINT) -> USMALLINT
#>  -(UINTEGER) -> UINTEGER
#>  -(UINTEGER, UINTEGER) -> UINTEGER
#>  -(UBIGINT) -> UBIGINT
#>  -(UBIGINT, UBIGINT) -> UBIGINT
#>  -(DATE, DATE) -> BIGINT
#>  -(DATE, INTEGER) -> DATE
#>  -(TIMESTAMP, TIMESTAMP) -> INTERVAL
#>  -(INTERVAL, INTERVAL) -> INTERVAL
#>  -(DATE, INTERVAL) -> DATE
#>  -(TIME, INTERVAL) -> TIME
#>  -(TIMESTAMP, INTERVAL) -> TIMESTAMP
#>  -(INTERVAL) -> INTERVAL

Created on 2024-02-04 with reprex v2.1.0

I'm new to the project, but had a look at this and found the following:

  • Currently sorting in descending order relies on manipulating the values being sorted, not defining the sort order. Specifically desc is defined as a macro that negates the value of it's parameters before being sorted in an ascending order.
  • While the above is fine for numeric types, this issue arises because you cannot negate varchar-type
  • If I understood correctly, arrange ultimately calls rapi_rel_order in the duckdb-r -package, which creates OrderByNode:s for duckdb to process. Said nodes have an ascending sort order as a hardcoded value

Possible solutions:

  1. Use the current solution and define an macro that alters varchars so that they are sorted correctly. However, I cannot think of any way to do this?
  2. Extract and pass the sort order of each expression in sort to rapi_rel_order and use that information to create OrderByNodes with the correct sort order
    • Basically means that if the expression is not wrapped in a desc-function call, the sort order is ascending?
    • It is not clear how to pass the information? I don't know if we can pass it within the duckdb expressions or would this require changing the signature of rapi_rel_order.
  3. Something completely different?

Thanks. Your second solution sounds like the way to go. Would you like to take a stab?

Sure, I'd like to give it a go.

ATM I plan to:

  1. Create a PR for duckdb-r -package that changes the signature of rapi_rel_order so that it can accept a vector of boolean values describing the sort order (if true, sort ascending)
    • The vector defaults to true for all expressions
  2. After 1. is merged to duckdb-r -package, tackle this issue by altering rel_order so that it passes the sort order of each expression to rapi_rel_order

Thanks, I forgot how much work this was. Happy to support you here.