- 1 Apache Arrow capstone project
- 2 Core
- 2.1 Allow bindings to use the
pkg::
prefixes - 2.2 Allow users to
arrow_eval
a function - 2.3 Document a binding (semi) automatically
- 2.4 More specific handling for known errors in
arrow_eval
- 2.5 Allow users to inspect ExecPlans (
show_query()
forarrow_dplyr_query
) - 2.6 Work around masking of data type functions
- 2.7 Document all of the above in user / developer facing documentation
- 2.8 Write a blog post
- 2.1 Allow bindings to use the
- 3 Stretch goals
Implementing the changes below would considerably improve both user and
developer experience. They will remove ambiguity when calling a binding
(by allowing users to use the pkg::
prefix), they will make it easier
to figure out what is going on when things go wrong (by adding a way to
inspect the Arrow query being generated in a dplyr pipeline, and by
allowing more granular condition handling/ messaging). Moreover, users
will be able to include their own functions (using regular R syntax),
operating on Arrow data, in a dplyr pipeline. Users will also have
access to a minimal documentation for each binding, point to the
function the binding is emulating and highlighting possible differences.
-
Jira: ARROW-14575
-
users will be able to use either the function name (
fun
) or the namespace-qualified (pkg::fun
) version when calling Arrow bindings -
when defining a binding we should be using the
pkg::fun
notation -
in connection to Allow users to
arrow_eval
a function it might mean the non-namespace-qualified version of a binding could be overwritten by a user deciding to register anotherfun
binding) -
Linked to: Document a binding (semi) automatically
-
Steps:
- Register each binding twice in the
nse_funcs
function registry (once asfun()
and once aspkg::fun()
). - Sort out edge cases, e.g for some of the unary functions (which are defined as a list which is used for several purposes)
- document (where?) that going forward we should be using
pkg::fun
when defining a binding, which will register 2 copies of the same binding.
- Register each binding twice in the
-
Estimated time: 3-4 days
-
Definition of done:
- users being able to correctly use a namespaced version of all our bindings, making, for example, the following snippet of code work:
nycflights13::flights %>% arrow_table() %>% mutate(year = lubridate::year(time_hour)) %>% collect()
- and being able to use the {lubridate} version of
date()
in the snippet below:
test_df <- tibble(
posixct_date = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"),
integer_var = c(32L, NA)
)
test_df %>%
arrow_table() %>%
mutate(a_date = lubridate::date(posixct_date)) %>%
collect()
- Jira: ARROW-14071
- users would be able to define their functions using regular R syntax, and, as long as a direct translation of the internals is possible with existing bindings, they would be able to use them in an {arrow} dplyr pipeline.
- there are several possible directions:
- will users need to register their own functions or would this be done automatically?
- when does the registration take place? before or after loading
arrow - maybe use
setHook()
to register function before attaching {arrow} - could we make use of {rlang}’s top and bottom of data mask?
- investigate if
rlang::inject()
could be used here - investigate if a simple injection with
!!
works
- Steps:
- Translate the user-defined function with the help of bindings
- A second step might be accessing / registering these functions
- Where is the function? Global environment, other script that is being sourced or a package
- If registering is required, order of namespace loading / attaching might be important
- Estimated duration: 2 weeks
- Definition of done: users will be able to define and use their
own bindings in {dplyr} - like pipelines, as long as they only
contain bindings that have already been implemented, or direct calls
to libarrow kernels. Users would be able to write their own version
of
nchar
for example and that would work:
nchar2 <- function(x) {
nchar(x)
}
tibble::tibble(my_string = "1234") %>%
arrow_table() %>%
mutate(nchar(my_string), nchar2(my_string)) %>%
collect()
- Jira: ARROW-15011
- Likely depends on Allow bindings to use the
pkg::
prefixes (removing any ambiguity with regards to the function we’re emulating) - This will allow us to document differences in behaviours, unsupported arguments, etc.
- I think this could be really useful for users. How would they access said documentation?
- As a stretch we might include the package version of the function we’re linking to. Functions / packages are not static and convergence between the original function and the {arrow} binding might drift over time.
- Steps:
- create a (manual) prototype of what a documented binding would look like
- generate a skeleton / minimal documentation automatically
- allow adjustment of the minimal documentation where needed
- decide on how the user will access the documentation
- might involve writing custom {roxygen2} tags and roclets
- follow-up ticket / step: document all bindings
- Possible breakdown:
- scaffolding for documenting a binding
- extend to cover package version
- document all bindings
- Estimated duration
- given the rough documentation on how to define custom roxygen2 extensions, I anticipate this will take extra time due to the need to understand the {roxygen2} source code.
- rough estimate: 3-4 weeks (without documenting all the bindings)
- Definition of done: bindings can be documented based on the function they emulate. At minimum a reference to the main function: “This binding should work just like X from package Y”.
- Jira: ARROW-13370
- This is related to, but not dependent on Document a binding (semi) automatically.
- At present we rely on the
"not supported.*Arrow"
incantation to identify anarrow-try-error
, which implies that error messages that depart from this are not surfaced and the users never see them. - Maybe expand the scope of this issue / ticket to include more precise messaging, in cases where a part of a more complex call fails to pinpoint to the exact location / reason for the failure.
- Components:
- we could opt for custom error messages when an argument is not supported, for example
- we could also warn when there is a difference between the binding and the original function (e.g. different default values, slight differences in implementation, etc.)
- Steps:
- expand the classification of Arrow errors
- update the behaviour: allow the printing of error messages when
they do not contain
"not supported.*Arrow"
- Estimated duration: 1-2 weeks
- Definition of done: allow more detailed error messages
(i.e. messages not containing than
"not supported.*Arrow"
can be classified asarrow-try-error
(or a different, related class) and surfaced to inform the user).
# this error messages surfaces
register_binding("difftime", function(time1,
time2,
tz,
units = "secs") {
if (units != "secs") {
abort("`difftime()` with units other than `secs` not supported in Arrow")
}
...
}
# while this one would get lost in the condition handling
register_binding("difftime", function(time1,
time2,
tz,
units = "secs") {
if (units != "secs") {
abort("This function errors for units other than `secs`")
}
...
}
- Jira: ARROW-15016:
- this would be useful for debugging / allowing users to inspect a query plan
- Steps
- implement
$ToString
/print()
methods for ExecPlan - wire them up to a new function,
show_arrow_query()
- implement
- Estimated duration: 3-4 days
- Definition of done: being able to print / inspect a query plan. The following snippet of code should work:
nycflights13::flights %>%
arrow_table() %>%
filter(year == 2013, month == 5, day == 27) %>%
show_arrow_query()
- Jira: ARROW-12322
- several data type functions in the arrow package are named very generically, so they represent a large area for potential masking problems.
- should be tackled before Allow bindings to use the
pkg::
prefixes - this would need to handle the following scenarios:
type
is a variable in the calling environment whose value is aDataType
objecttype
is a call to a function defined by the user that returns aDataType
objecttype
is any arbitrary R expression that the user has wired up to return aDataType
objecttype
is a call to a function in another package or in the user’s environment that masks the Arrow type function of the same name
- Steps:
- Definition of done:
- the following code should run:
# Works when type function is masked
string <- rlang::string
expect_equal(
Array$create("abc", type = string()),
arrow::string()
)
# Works when with non-Arrow function that returns an Arrow type
# when the non-Arrow function has the same name as a base R function...
str <- arrow::string
expect_equal(
Array$create("abc", type = str()),
arrow::string()
)
# ... and when it has the same name as an Arrow function
type <- arrow::string
expect_equal(
Array$create("abc", type = type()),
arrow::string()
)
# Works with local variable whose value is an Arrow type
type <- arrow::string()
expect_equal(
Array$create("abc", type = type),
arrow::string()
)
I haven’t detailed these, as that could be done depending on the progress of the work on the core tickets.
- and the differences between them and the functions they aim to replace
- no Jira ticket yet
- I’m thinking of an interface, maybe shiny that allows a use to see ,
for example, what lubridate functionality is currently supported in
Arrow and maybe explore the differences between the
fun
binding andlubridate::fun
.
- Jira: ARROW-14855
build_expr()
should check that non-expression inputs havevec_size() == 1L
- This is a bit of an extra ticket, not strictly related to the ones above.
- Jira: ARROW-12093
- related to More specific handling for known errors in
arrow_eval
- Jira: ARROW-16444
- related to Allow users to
arrow_eval
a function - I think this issue might be on Dewey’s list