polars
(both the Rust source and the R implementation) are amazing
packages. I won’t argue here for the interest of using polars
, there
are already a lot of resources on its
website.
One characteristic of polars
is that its syntax is 1) extremely
verbose, and 2) very close to the pandas
syntax in Python. While this
makes it quite easy to read, it is yet another syntax to learn for R
users that are accustomed so far to either base R, data.table
or the
tidyverse
.
The objective of tidypolars
is to provide functions that are very
close to the tidyverse
ones but that call the polars
functions
under the hood so that we don’t lose anything of its capacities.
Morevoer, the objective is to keep tidypolars
dependency-free with
the exception of polars
itself (which has no dependencies).
Overall, you only need to add pl_
as a prefix to the tidyverse
function you’re used to. For example, dplyr::mutate()
modifies classic
R
dataframes, and tidypolars::pl_mutate()
modifies polars
’
DataFrame
s and LazyFrame
s.
library(polars)
library(tidypolars)
pl_test <- pl$DataFrame(iris)
pl_test
#> shape: (150, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ cat │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
#> │ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
#> │ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
#> │ … ┆ … ┆ … ┆ … ┆ … │
#> │ 6.3 ┆ 2.5 ┆ 5.0 ┆ 1.9 ┆ virginica │
#> │ 6.5 ┆ 3.0 ┆ 5.2 ┆ 2.0 ┆ virginica │
#> │ 6.2 ┆ 3.4 ┆ 5.4 ┆ 2.3 ┆ virginica │
#> │ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
#> └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
pl_test |>
pl_filter(Species == "setosa") |>
pl_arrange(Sepal.Width, -Sepal.Length)
#> shape: (50, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ cat │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
#> │ 4.5 ┆ 2.3 ┆ 1.3 ┆ 0.3 ┆ setosa │
#> │ 4.4 ┆ 2.9 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 5.0 ┆ 3.0 ┆ 1.6 ┆ 0.2 ┆ setosa │
#> │ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ … ┆ … ┆ … ┆ … ┆ … │
#> │ 5.8 ┆ 4.0 ┆ 1.2 ┆ 0.2 ┆ setosa │
#> │ 5.2 ┆ 4.1 ┆ 1.5 ┆ 0.1 ┆ setosa │
#> │ 5.5 ┆ 4.2 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 5.7 ┆ 4.4 ┆ 1.5 ┆ 0.4 ┆ setosa │
#> └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
pl_test |>
pl_mutate(
Sepal.Total = Sepal.Length + Sepal.Width,
Petal.Total = Petal.Length + Petal.Width
) |>
pl_select(ends_with("Total"))
#> shape: (150, 2)
#> ┌─────────────┬─────────────┐
#> │ Sepal.Total ┆ Petal.Total │
#> │ --- ┆ --- │
#> │ f64 ┆ f64 │
#> ╞═════════════╪═════════════╡
#> │ 8.6 ┆ 1.6 │
#> │ 7.9 ┆ 1.6 │
#> │ 7.9 ┆ 1.5 │
#> │ 7.7 ┆ 1.7 │
#> │ … ┆ … │
#> │ 8.8 ┆ 6.9 │
#> │ 9.5 ┆ 7.2 │
#> │ 9.6 ┆ 7.7 │
#> │ 8.9 ┆ 6.9 │
#> └─────────────┴─────────────┘
No, or just marginally. The objective of tidypolars
is not to modify
the data, simply to translate the tidyverse
syntax to polars
syntax.
polars
is still in charge of doing all the data manipulations under
the hood.
Therefore, there might be minor overhead because we still need to parse
the expressions and rewrite them in polars
syntax but this should be
extremely marginal.
No, as said above, tidypolars
just changes one syntax to another but
it doesn’t touch the data itself. So if for some reason you want to go
back to a “raw” polars
syntax later in your code, you’re free to do so
because tidypolars
will always return DataFrame
s, LazyFrame
s or
Series
.
Yes, because tidypolars
doesn’t provide any functions to create
polars
DataFrame
or LazyFrame
, or to read data. You’ll still need
to use polars
for this.
Sure but take them with a grain of salt: these small benchmarks may not
be representative of real-life scenarios and don’t necessarily use the
full capacities of other packages (e.g keyed data.table
s). You should
refer to DuckDB benchmarks
for more serious ones.
library(polars)
library(tidypolars)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
test <- data.frame(
grp = sample(letters, 1e7, TRUE),
val1 = sample(1:1000, 1e7, TRUE),
val2 = sample(1:1000, 1e7, TRUE)
)
pl_test <- pl$DataFrame(test)
dt_test <- as.data.table(test)
bench::mark(
polars = pl_test$
groupby("grp")$
agg(
pl$col('val1')$mean()$alias('x'),
pl$col('val2')$sum()$alias('y')
),
tidypolars = pl_test |>
pl_group_by(grp) |>
pl_summarize(
x = mean(val1),
y = sum(val2)
),
dplyr = test |>
group_by(grp) |>
summarize(
x = mean(val1),
y = sum(val2)
),
data.table = dt_test[, .(x = mean(val1), y = sum(val2)), by = grp],
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 polars 82.4ms 86.1ms 11.5 139KB 0
#> 2 tidypolars 81.6ms 83.8ms 11.7 226KB 0
#> 3 dplyr 239ms 240.9ms 4.10 242MB 5.47
#> 4 data.table 187.2ms 215.9ms 4.43 273MB 2.95
bench::mark(
polars = pl_test$
filter(pl$col("grp") == "a" | pl$col("grp") == "b"),
tidypolars = pl_test |>
pl_filter(grp == "a" | grp == "b"),
dplyr = test |>
filter(grp %in% c("a", "b")),
data.table = dt_test[grp %chin% c("a", "b")],
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 polars 42.3ms 45.4ms 21.6 15.1KB 0
#> 2 tidypolars 41.7ms 45.6ms 21.9 11.6KB 0
#> 3 dplyr 241.7ms 284.1ms 3.52 281.9MB 7.04
#> 4 data.table 24.4ms 25.4ms 36.2 100.6MB 1.91