Possible error in setup
Closed this issue ยท 39 comments
Thanks for picking up the task and trying this. As the outcome 'does not pass the smell test' I had a really brief look in which I
- Generated a smaller data set with your helper script:
Rscript generate_data.R 1e6
- Create a script 'allin.R' script (below) to compare
data.table
(an outlier in your results),collapse
anddplyr
.
My result are closer to what is expected:
edd@rob:~/git/1br(feature/simplication)$ Rscript allin.R
Unit: milliseconds
expr min lq mean median uq max neval cld
datatable 11.7441 14.2740 16.8943 15.4098 17.9822 47.4669 100 a
collapse 16.9737 18.9491 23.4794 20.6923 23.0194 84.8558 100 ab
dplyr 18.8856 20.9170 26.8166 22.3052 24.4867 369.0298 100 b
edd@rob:~/git/1br(feature/simplication)$
We can also plot this thanks to ggplot2::autoplot()
:
I am also not sure if going via tidypolars
is fair to rpolars
. It would be better to do this directly, and maybe add the tidypolars
as an alternate.
The (very short) script follows.
D <- data.table::fread("measurements.csv")
res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> collapse::fungroup(),
dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup())
print(res)
ggplot2::autoplot(res)
hi!
First thanks for your comment!
I hear the all the comments on data.table result and I am surprised myself.
I agree when I run a smaller dataset (eg. 1e6) I get analysis similiar to yours but for some reason when I run it greater 500M rows the DT performance drops.
However others who have tried the 1e9 dataset have not been able to reproduce my findings with specific regard to data.table.
I need to understand why I am getting my data.table -- I tried to create a clean environment for each test with system2() but it still isn't working well.
I'm going to update my post on twitter / readme to clarify the data.table results are in dispute as I don't want to spread misinformation.
I'll update the polars with the code you suggested and I'll keep working the data.table angle of this.
I saw the tweet to you by @vincentarelbundock, and I also had a chance to hop onto a larger machine to validate my more traditional and standard use of benchmarking (ie directly write the expressions to measure in the call, let the benchmarking package and function do the householding) -- which complements Vincent re-run of you scriot.
In this large case of 1e9 rows, the difference between data.table and dplyr narrows, but both gain on collapse. So in short I am with Vincent here, and iterate my concern from the top of this issue ticket. You likely want to build this up more carefully.
Code
res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> collapse::fungroup(),
dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
avg(measurement) from stable group by state"),
times = 10)
print(res)
ggplot2::autoplot(res)
Results Table
> res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+ times = 10)
> res
Unit: seconds
expr min lq mean median uq max neval
datatable 26.47838 28.13759 30.19073 29.81849 31.90281 35.00091 10
collapse 50.19573 50.93365 53.09633 52.23915 56.23458 57.41016 10
dplyr 26.84754 27.93581 30.34774 30.00457 30.68157 35.96263 10
>
Results Plot
Hi! thanks for taking the time to run the analysis and I really value yours and @vincentarelbundock feedback -- I am working on how to better run the analysis to avoid set up issues.
To clarify, when looking at your results, your DT and dplyr results are actually consistent with mine (eg. both are around 30 seconds).
For your collapse result -- if you run the analysis with fmean, fmax, fmin rather than the regular mean,min, and max do you still see the same outcomes?
If you update with the collapse version of the aggregation metrics, I suspect you may get similiar metrics as what I originally posted -- are you able to confirm?
In my latest sourcing, I converted each file to a bash file and run it separately with system2() because when running it directly as a call in microbenchmark I was getting feedback that each simulation may be altering the current session which I wanted to avoid.
Good catch on me having dropped the f
for the three collapse
aggregators. But re-running yields essentially unchanged results:
> res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement), max=collapse::fmax(measurement), mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+ times = 10)
> res
Unit: seconds
expr min lq mean median uq max neval
datatable 29.06584 29.25456 30.59570 30.37740 31.55003 33.28734 10
collapse 51.01506 51.94234 53.87729 53.78601 54.91867 59.91626 10
dplyr 27.72498 29.45540 31.21465 30.65568 33.63867 35.84741 10
>
As for using system()
or system2()
: there are now decades worth of careful use of benchmark
, microbenchmark
, bench
... all measuring from a single expression. It is common, and it is the correct thing to do. If you have side effects, rather remove the side effects? You get into new cans of worms by measuring new processes via system()
or systeme()
. At least consider the alternative on test runs on 1e5 or 1e6 to validate your setup.
I also used duckdb
with a direct SQL expression. it is way faster than the others by an order of magnitude on this (very large) machine but it may also use more cores. One can then do the same easily with RSQLite
and it is way slower (!!). I am not convinced that preloading the into a SQL database in memory is a fair comparison so I dropped it. And I was unable to write a polars expression using column measurement
more than once but that may be on me. You should, I think, measure each at their strength and core use and not via a 'tidy' redirect (which may be a second, more clearly labelled, benchmark set).
I am not convinced that preloading the into a SQL database in memory is a fair comparison so I dropped it.
DuckDB registers R data.frame as a virtual table in the database and should not actually load the data1, so there should be little cost to register the data in the table.
Line 20 in 83615d3
Of course, SQLite does not have such a feature, so copying data will always take time in SQLite.
Footnotes
You may have misread / misinterpreted what I wrote and meant and I did not share my code. In my exploratory tests I had an explicit DBI::dbWriteTable(dcon, "dtable", D, overwrite=TRUE)
for both (here 'd' for duckdb, same with 's' for sqlite). Given that representation OLAP (ie duckdb
) is seen as blazingly fast (and credit to duckdb for being very good at what it does), wheras OLTP (ie sqlite
) is not. But I don't think it is an entirely fair example (though someone posted a complete DuckDB and SQL example at the 1brc site, including the data read.
My objection, really, is about measuring the comparables, ie direct by group summaries in data.table
, dplyr
, collapse
and (where i failed) (r)polars
. I am much less interested in tidy translators though that could be a second investigation.
I see.
By the way, as a complete aside, Polars has SQL support, so such a query can be handled completely in SQL.
https://rpolars.github.io/man/SQLContext_class/
If you use the polarssql
package (which I wrote) you should be able to operate via the DBI
package.
https://github.com/rpolars/r-polarssql
Interesting. We may be starting to misuse this thread but when I try that (with the Jammy build of polars
via r-universe) I get the same issue I had before when I tried to craft this by hand: the column in question only wants to be used once. Which is ... a little odd.
> lf <- polars::pl$LazyFrame(D)
> res <- polars::pl$SQLContext(frame = lf)$execute("select min(measurement), max(measurement), avg(measurement) from frame group by state")
Error: Execution halted with the following contexts
0: In R: in $schema():
0: During function call [.rs.describeObject(<environment>, "res", TRUE)]
1: Encountered the following error in Rust-Polars:
duplicate: column with name 'measurement' has more than one occurrences
Error originated just after this operation:
DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"
Error originated just after this operation:
ErrorStateSync(AlreadyEncountered(duplicate: column with name 'measurement' has more than one occurrences
Error originated just after this operation:
DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"))
DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"
>
My guess is that unlike most RDBMSs, Polars uses col
for the column name of the result of something like func(col)
, so the column name of the result is duplicated.
Using as and naming it differently would solve the error, wouldn't it? Like select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as avg_m from frame group by state
Oh, for completeness, I am running the 1e9 examples on a large server on campus having 512gb ram (!!) and 48 (slow-ish) Xeon cores at 2.1gb. If memory serves, data.table
will default to 50% of the cores, but I do not know right now what dplyr
and collapse
do with all this.
Most excellent:
> res <- polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")
> res
[1] "polars LazyFrame naive plan: (run ldf$describe_optimized_plan() to see the optimized plan)"
SELECT [col("min_m"), col("max_m"), col("mean_m")] FROM
AGGREGATE
[col("measurement").min().alias("min_m"), col("measurement").max().alias("max_m"), col("measurement").mean().alias("mean_m")] BY [col("state")] FROM
DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"
>
and wicked fast (same as duckdb
) when starting from the already-prepared data structure:
> system.time(res$collect())
user system elapsed
70.505 19.169 2.140
>
And still very respectable and even best in class when we do it all in once:
> system.time({lf <- polars::pl$LazyFrame(D); res <- polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")$collect()})
user system elapsed
82.638 29.134 22.918
>
I will stick this into the benchmark proper. Big thank you, as always, @eitsupi !
We have a new champion:
> res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement), max=collapse::fmax(measurement), mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+ polars = {{lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")$collect()}},
+ times = 10)
> res
Unit: seconds
expr min lq mean median uq max neval
datatable 27.04999 27.57103 28.90611 29.30950 29.95487 30.76303 10
collapse 51.02113 51.96048 53.08097 52.93228 53.61110 56.55661 10
dplyr 28.85432 29.22231 31.12528 30.57937 33.38361 34.94457 10
polars 23.64507 23.92012 25.25210 24.77925 26.39392 28.33606 10
> ggplot2::autoplot(res)
>
Very impressively polars
showing with and without SQL. Even if I force the data.frame-to-polars conversion on each run, it wins:
> res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement), max=collapse::fmax(measurement), mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+ polars_sql = {{lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")$collect()}},
+ polars = polars::pl$DataFrame(D)$group_by("state")$agg(polars::pl$col("measurement")$min()$alias("min_m"),polars::pl$col("measurement")$max()$alias("max_m"),polars::pl$col("measurement")$mean()$alias("mean_m")),
+ times = 10)
> res
Unit: seconds
expr min lq mean median uq max neval
datatable 26.40483 27.36088 28.79502 28.61824 30.08184 32.24413 10
collapse 47.62153 49.16795 50.97285 50.07759 51.46359 60.20597 10
dplyr 28.40911 29.34691 30.85193 30.55275 31.97368 34.53324 10
polars_sql 22.37270 22.51855 24.28000 23.36114 26.76574 27.04358 10
polars 22.41094 22.66564 23.44019 22.94000 23.88544 25.63082 10
> ggplot2::autoplot(res)
>
And as the old 'it all depends' still holds here I what I get back on my (slightly aged) computer with 1e6 and 1e7:
1e6
> D <- data.table::fread("measurements.1e6.csv")
+ sqltxt <- "select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state"
+ res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),
+ min=min(measurement),
+ max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement),
+ max=collapse::fmax(measurement),
+ mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement),
+ max=max(measurement),
+ mean=mean(measurement)) |> dplyr::ungroup(),
+ polars_sql = {lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute(sqltxt)$collect()},
+ polars = polars::pl$DataFrame(D)$group_by("state")$agg(polars::pl$col("measurement")$min()$alias("min_m"),
+ polars::pl$col("measurement")$max()$alias("max_m"),
+ polars::pl$col("measurement")$mean()$alias("mean_m")),
+ times = 10)
+
+ print(res)
> + > Unit: milliseconds
expr min lq mean median uq max neval cld
datatable 13.3638 13.7731 15.0158 14.9566 15.8077 17.4406 10 a
collapse 17.3992 18.0630 25.3196 23.7119 30.3520 42.5892 10 b
dplyr 20.3400 22.9261 25.4593 26.2902 26.8203 31.4192 10 b
polars_sql 32.3368 33.0572 34.8059 34.9782 35.3903 39.2898 10 c
polars 33.0885 33.2309 35.2959 35.1821 36.4092 38.9956 10 c
>
1e7
> D <- data.table::fread("measurements.1e7.csv")
+ sqltxt <- "select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state"
+ res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),
+ min=min(measurement),
+ max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement),
+ max=collapse::fmax(measurement),
+ mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement),
+ max=max(measurement),
+ mean=mean(measurement)) |> dplyr::ungroup(),
+ polars_sql = {lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute(sqltxt)$collect()},
+ polars = polars::pl$DataFrame(D)$group_by("state")$agg(polars::pl$col("measurement")$min()$alias("min_m"),
+ polars::pl$col("measurement")$max()$alias("max_m"),
+ polars::pl$col("measurement")$mean()$alias("mean_m")),
+ times = 10)
+
+ print(res)
+ p1e7 <- ggplot2::autoplot(res) + ggplot2::ggtitle("N = 1e7") + tinythemes::theme_ipsum_rc()
> > Unit: milliseconds
expr min lq mean median uq max neval cld
datatable 138.280 153.186 179.441 163.553 172.607 325.932 10 a
collapse 387.709 401.260 437.516 426.304 475.471 526.429 10 b
dplyr 226.344 239.860 268.107 268.591 281.579 326.171 10 c
polars_sql 287.727 291.987 301.697 297.225 299.797 346.353 10 c
polars 291.714 292.332 298.029 297.448 301.144 307.805 10 c
>
With patchwork
side by side:
We can be a little cute and use range
rather than min
and max
directly. On 1e9 this gives me quite a reasonable saving with data.table
(and I would presume dplyr
).
library(data.table)
fun <- function(x) {
range <- range(x)
list(mean = mean(x), min = range[1], max = range[2])
}
DT <- fread("measurements.csv")
(res <- microbenchmark::microbenchmark(
dt = DT[ ,.(mean=mean(measurement), min=min(measurement), max=max(measurement)), by=state],
dt_range = DT[, fun(measurement), by = state],
times = 1,
check = "equal"
))
#> Unit: seconds
#> expr min lq mean median uq max neval
#> dt 17.52600 17.52600 17.52600 17.52600 17.52600 17.52600 1
#> dt_range 10.23352 10.23352 10.23352 10.23352 10.23352 10.23352 1
That's good. Likeky nixes out one pass over the data.
For completeness, when I posted this on social media, @schochastics made two good points:
- We could and should include
aggregate
,tapply
,by
and even didlapply(split(...), ...): apart from
aggregate` which seems to stink in both formula mode and in standard two argument mode these look fairly good still at 1e7 - We should convert
state
to afactor
. This is after all an R test, and if we let (in some setups) other play to their strength (i.e. copy into a DB table in memory) then we may as well let R have its preferred 'group by this var' setup.
I am also starting to think that these should really be presented / aggregated over 1e6, 1e7, 1e8, and 1e9 rows. The latter is interesting (if alone for the comparison with 1brc) but that size seems to flatten some differences as the core time seems to be spend running over the data vector to compute the statistics (three times as @TimTaylor observed -- which can improve upon too).
Here is my base R benchmark excluding polars (and two versions of tapply, because I was wondering if do.call is faster)
D <- data.table::fread("measurements1e6.csv", stringsAsFactors = TRUE)
sum_stats_dplyr <- function(x) {
rg <- range(x)
tibble::tibble(mean = mean(x), min = rg[1], max = rg[2])
}
sum_stats_list <- function(x) {
rg <- range(x)
list(mean = mean(x), min = rg[1], max = rg[2])
}
sum_stats_vec <- function(x) {
rg <- range(x)
c(mean = mean(x), min = rg[1], max = rg[2])
}
microbenchmark::microbenchmark(
dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(sum_stats_dplyr(measurement)) |> dplyr::ungroup(),
datatable = D[, .(sum_stats_list(measurement)), by = state],
collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(sum_stats_dplyr(measurement)) |> collapse::fungroup(),
aggregate = aggregate(measurement ~ state, data = D, FUN = sum_stats_vec),
tapply1 = do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)),
tapply2 = sapply(tapply(D$measurement, D$state, sum_stats_vec), rbind),
lapply = {
results <- lapply(split(D, D$state), function(x) {
rg <- range(x$measurement)
data.frame(
state = unique(x$state),
min = rg[1],
max = rg[2],
mean = mean(x$measurement)
)
})
do.call(rbind, results)
},
by = by(D$measurement, D$state, sum_stats_vec), times = 10
)
detailed for N=1e8
Unit: seconds
expr min lq mean median uq max neval
dplyr 3.66663 4.04299 4.41022 4.34344 4.48238 5.57671 10
datatable 3.50274 3.51561 3.90109 3.71339 4.03352 5.27914 10
collapse 5.15419 5.27137 5.70125 5.39071 5.91769 7.32761 10
tapply1 4.39228 4.51080 5.16972 5.07856 5.53077 6.79544 10
tapply2 3.60726 4.35811 4.91496 4.88236 5.18694 6.52182 10
lapply 9.28849 9.45204 10.11139 9.84460 10.78338 11.72989 10
by 7.08288 7.38480 8.12217 7.97933 8.42832 10.36398 10
ok there is an interesting contender:
reduce = {
state_list <- split(D$measurement, D$state)
Reduce(function(x, y) {
res <- sum_stats_vec(state_list[[y]])
rbind(x, c(state = y, mean = res[1], min = res[2], max = res[3]))
}, names(state_list), init = NULL)
}
for N=1e8
Unit: seconds
expr min lq mean median uq max neval
dplyr 2.85810 3.10782 3.75953 3.30874 4.31709 6.26115 10
datatable 2.50896 2.61430 2.91622 2.76897 3.22380 3.63510 10
collapse 4.04716 4.13675 4.94366 4.54078 5.32954 6.82324 10
tapply1 3.17849 3.50470 4.00139 3.77609 4.01608 5.82055 10
tapply2 3.19185 3.41511 3.53030 3.46444 3.76154 3.89066 10
lapply 6.36581 7.12786 7.62443 7.37824 7.96529 9.47024 10
by 5.25023 5.42275 6.49624 6.27840 7.54884 8.45265 10
reduce 2.86560 3.23987 4.18529 3.60106 4.89074 6.89234 10
BTW I found sorting by, say, median time helps. I found an open issue at microbenchmark
(for the plot function) but it really is just the older one-liner to recreate levels in order. I quickly wrapped that into a simpler helper in a 'misc goodies' package of mine: https://github.com/eddelbuettel/dang/blob/master/R/reorderMicrobenchmarkResults.R
sorry for spamming this issue so much. But I think this is the last for me:
I remembered that Rfast
exists.
rfast = {
lev_int <- as.numeric(D$state)
minmax <- Rfast::group(D$measurement, lev_int, method = "min.max")
data.frame(
state = levels(D$state),
mean = Rfast::group(D$measurement, lev_int, method = "mean"),
min = minmax[1, ],
max = minmax[2, ]
)
}
again, N=1e8 (ordered by mean, thanks @eddelbuettel)
Unit: seconds
expr min lq mean median uq max neval
rfast 1.25014 1.41937 1.62158 1.67204 1.81019 1.97656 10
datatable 2.24556 2.47969 2.68018 2.68014 2.91236 2.99353 10
tapply2 2.56247 2.66205 3.07919 2.88239 3.71769 3.82766 10
reduce 2.54882 2.70055 2.96301 2.91298 3.02862 3.86465 10
tapply1 2.56365 2.65276 2.99696 2.97633 3.06698 4.03936 10
dplyr 2.98505 3.22951 3.40789 3.39959 3.57909 3.83317 10
collapse 4.02262 4.16275 4.55015 4.29150 4.66520 6.63107 10
by 4.81229 4.98914 5.18943 5.01704 5.48522 5.98817 10
lapply 5.85601 6.23911 6.57519 6.51415 6.61873 8.13780 10
Unfortunately I am failing to install polars
on my old machine to check how it compares
you can actually gain a few secs on 1e9 if you apply an additional SAC to DT:
library(data.table)
library(datasets)
n <- 1e9
set.seed(2024)
y <- data.table(
m = stats::rnorm(n),
s = base::sample(datasets::state.abb, size = n, replace = TRUE)
)
yr <- microbenchmark::microbenchmark(
'dt ' = y[ , .(mean(m), min(m), max(m)), s],
'dt-sac' = rbindlist(lapply( unique(y$s), \(x) data.table( s = x, y[s == x , .(mean(m), min(m), max(m))] ) )),
times = 10
)
yr
ggplot2::autoplot(yr)
Unit: seconds
expr min lq mean median uq max neval cld
dt 25.08501 25.61196 25.82276 25.92505 26.02473 26.43565 10 a
dt-sac 22.35620 22.87455 23.41769 22.96819 23.25624 26.92710 10 b
For completeness, and as @TimTaylor had pointed out to me in DM and at the data.table
issue linked above, the development version of data.table
is quite a bit faster:
Results
> res
Unit: seconds
expr min lq mean median uq max neval
datatable 11.15210 12.11368 13.50438 13.47416 14.62594 16.71623 10
polars 22.73818 22.84256 24.19483 23.97759 24.61460 26.98668 10
polars_sql 22.65298 22.89407 24.85488 24.19280 26.52703 28.26696 10
dplyr 27.50917 30.36789 30.73162 31.01001 31.30612 33.00118 10
collapse 48.79312 50.55301 53.45232 52.58460 54.01983 61.88502 10
tapply 55.46752 56.28768 57.49995 57.24879 58.14358 61.21602 10
lapply 71.10101 73.53757 75.22146 74.20503 77.62213 80.32287 10
by 71.32586 72.49985 74.28661 74.38054 75.79082 78.18177 10
>
Plot
This uses the current development branch of data.table, the respective changes have been in that branch for a while as far as I know.
Awesome - I hadn't compared with polars so that's an impressive surprise.
@eddelbuettel I hear you on your reasons for not including DuckDB. But since Polars is using the same OLAP-type engine, albeit here embedded directly in R "dialect", I feel it's only fair that it gets a showing too. (Also to show R users, specifically, just how freaking fast it is.)
As @eitsupi says, registering a DuckDB table in memory from R is pretty quick and painless. No real difference IMO to setting/scanning a lazy Polars frame.
Otherwise, very impressive speed improvements in the latest version of data.table. I believe an official 1.15.0 release is just around the corner...
duckdb memory vs duckdb disk vs data.table (dev)
@eddelbuettel I really don't know how they do it, but using the disk is faster than memory...
library(data.table)
library(datasets)
library(duckdb)
n <- 1e9
set.seed(2024)
y <- data.table(
m = stats::rnorm(n),
s = base::sample(datasets::state.abb, size = n, replace = TRUE)
)
cnm <- dbConnect(duckdb())
dbWriteTable(cnm, 'obr', y, row.names = FALSE)
cnf <- dbConnect(duckdb('~/temp/obr.dkdb')) # 9GB file
dbWriteTable(cnf, 'obr', y, row.names = FALSE)
yr <- microbenchmark::microbenchmark(
'dt' = y[ , .(mean(m), min(m), max(m)), s],
'dkm' = dbGetQuery(cnm, 'SELECT s, min(m), max(m), avg(m) from obr group by s'),
'dkf' = dbGetQuery(cnf, 'SELECT s, min(m), max(m), avg(m) from obr group by s'),
times = 10
)
yr
# Unit: seconds
# expr min lq mean median uq max neval cld
# dt 11.139752 11.691545 12.244513 12.529754 12.578781 12.665636 10 a
# dkm 1.634721 1.655189 1.664793 1.667811 1.676581 1.683979 10 b
# dkf 1.151143 1.172245 1.177155 1.177535 1.185760 1.188577 10 c
ggplot2::autoplot(yr)
dbDisconnect(cnm, shutdown = TRUE)
dbDisconnect(cnf, shutdown = TRUE)
I had these too (using a memory db), but hesitated showing them. Of course duckdb
is blazingly fast here, and that is why we all love it. But the original one billion row benchmark started from csv file. So starting from a data.frame
(or, rather, data.table
inheritiing from data.frame
) seems like a fair comparison.
Not accounting for creating the duckdb
internal data structure is, to me, not fair. So I didn't include it. Of course, once you have such a table, of course grouped summaries are fast. That's what the engine is there for. duckdb
has other strength too -- I just have not found a good apples-to-apples comparison. We're doing apples-to-peaches. Makes for a nice cobbler but ain't fair ...
on another note, having a big machine does not seem to be such a big deal with data.table (unless I got something wrong)
setDTT <- \(x) { setDTthreads(x); y[ , .(mean(m), min(m), max(m)), s] }
yr <- microbenchmark::microbenchmark(
'dt06' = setDTT( 6),
'dt12' = setDTT(12),
'dt18' = setDTT(18),
'dt24' = setDTT(24),
'dt30' = setDTT(30),
times = 10
)
yr
# Unit: seconds
# expr min lq mean median uq max neval cld
# dt06 11.74759 12.38181 12.69686 12.72297 13.02676 13.68092 10 a
# dt12 11.03751 11.44846 12.24313 12.24350 12.94132 13.41429 10 ab
# dt18 10.17322 10.31603 10.86685 10.86211 11.32207 11.55271 10 c
# dt24 10.62439 10.77381 11.47472 11.37302 11.96992 13.27930 10 bc
# dt30 10.12770 10.32557 11.25893 11.30042 11.73645 12.94521 10 c
ggplot2::autoplot(yr)
@eddelbuettel - I think the closest comparison (if you wanted to go that route) would be using the duckdb_register()
function and including that as part of the timing (I'm not a ducks user but @eitsupi highlighted it earlier).
That said - constraints are good... For my own timings I've made use of {inline} and a simple compiled loop ... 'tis all still R code but doesn't fit with the spirit so ๐คทโโ๏ธ
Not accounting for creating the
duckdb
internal data structure is, to me, not fair.
Setting up an index beforehand on the state
column gets data.table nearer the duckdb result, with also differences in cores used now appreciated to some extent:
setkey(y, 's')
yr <- microbenchmark::microbenchmark(
'dt06' = setDTT( 6),
'dt12' = setDTT(12),
'dt18' = setDTT(18),
'dt24' = setDTT(24),
'dt30' = setDTT(30),
times = 10
)
yr
# Unit: seconds
# expr min lq mean median uq max neval cld
# dt06 8.124833 8.681474 8.782995 8.877845 8.980673 9.372818 10 a
# dt12 7.016027 7.191650 7.522454 7.543799 7.796921 8.077547 10 b
# dt18 5.549116 5.945838 6.280376 6.316387 6.623486 6.734678 10 c
# dt24 4.905373 4.976062 5.426632 5.450692 5.854341 5.948980 10 d
# dt30 4.421263 4.525206 4.723312 4.593708 4.632048 5.478117 10 e
ggplot2::autoplot(yr)
using the disk is faster than memory...
I think what you see here is the Linux kernel helping you turning disk into memory by caching ...
The threading example is good. As is using an index -- we'd get that via keyby=state
instead of by=state
too methinks.
We can be a little cute and use
range
rather thanmin
andmax
directly. On 1e9 this gives me quite a reasonable saving withdata.table
It generally should not, because range is not GForce optimized. Are you by any chance on an older version of DT? using verbose=T is quite useful in such cases
It generally should not, because range is not GForce optimized. Are you by any chance on an older version of DT? using verbose=T is quite useful in such cases
Yes initially it was 1.14.10 but now switched to master and direct (min max) approach is much faster.
Maybe a bit late to the party but here we go:
Setup from lvalnegri
sample here has 50 fifty different groups
library(datasets)
library(data.table)
n <- 1e8
set.seed(2024)
y <- data.table(
m = stats::rnorm(n),
s = base::sample(datasets::state.abb, size = n, replace = TRUE)
)
setkey(y, 's')
setDTT <- \(x) { setDTthreads(x); y[ , .(mean(m), min(m), max(m)), s] }
yr <- microbenchmark::microbenchmark(
'dt02' = setDTT(2),
'dt04' = setDTT(4),
'dt06' = setDTT(6),
'dt08' = setDTT(8),
'dt10' = setDTT(10),
times = 10
)
yr
# 1.14.10
# Unit: seconds
# expr min lq mean median uq max neval cld
# dt02 3.003172 3.074757 3.148601 3.138230 3.240320 3.311348 10 a
# dt04 2.751539 2.840537 2.906491 2.877485 2.963776 3.119828 10 b
# dt06 2.699096 2.802055 2.850967 2.848263 2.914460 3.045719 10 b
# dt08 2.663448 2.682338 2.805176 2.787870 2.860383 3.063182 10 bc
# dt10 2.621527 2.682646 2.703065 2.705433 2.739840 2.802580 10 c
# 1.14.99 aka 1.15.0
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dt02 1401.196 1515.518 1546.238 1542.442 1600.611 1670.684 10 a
# dt04 1106.861 1290.204 1318.007 1303.570 1365.536 1450.535 10 b
# dt06 1074.459 1087.036 1143.726 1128.395 1223.271 1235.099 10 c
# dt08 1036.661 1042.637 1116.963 1127.266 1176.382 1195.730 10 cd
# dt10 967.258 1021.693 1041.920 1026.675 1043.143 1143.767 10 d
# gmin parallelized per group
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dt02 1221.1177 1264.3856 1316.9177 1311.3068 1392.7166 1416.7932 10 a
# dt04 844.3419 910.7867 1144.4065 941.2959 1035.7934 2937.8885 10 ab
# dt06 767.6377 787.4808 833.9695 812.1181 842.2414 1011.9838 10 b
# dt08 785.2025 802.7681 847.5161 814.6925 848.3238 1024.4026 10 b
# dt10 697.5232 753.0673 809.0413 799.7735 883.6270 907.5998 10 b
With more groups or more cores the speedup from gmin parallelized to 1.15.0 should be even higher
hey everyone, especially @eddelbuettel, @eitsupi, @TimTaylor and @schochastics for all the discussion!
I had business travel to Thailand and couldn't work this and quite frankly the skill set of the conversation is way over my skill level.
What I basically have concluded that it is a bit meaningless to call something a "benchmark" unless there are consistent or clear standards of said benchmark.
Seeing the various results, I noticed there can be considerable variability in running the platforms which suggests that in addition to R environment / package issues, the underlying machine and configuration can add significant variability to results.
The lesson for me is that different packages can perform differently depending on how you set them up, the datasets structure and your machine's configurations -- however trying to declare a "winner" is futile unless there the benchmark standards are clear.
Thank you everyone for spending time and effort in helping to understand the issue. I do eventually want to do write up on the learning here but I won't have time any time soon.
btw. this is good read: https://rdatatable.gitlab.io/data.table/articles/datatable-benchmarking.html basics mostly but still important