SebKrantz/collapse

[Just an observation] Relative performance degradation at largest scale?

eddelbuettel opened this issue ยท 7 comments

Not sure if you saw but a few days ago everybody got all abuzz about the 'billion row challenge' (initially for Java). An R repo was set up (at https://github.com/alejandrohagan/1br) but a few of us had questions about an initial tweet (since removed). I ended up piling a bunch of benchmark results up in issue 5 there. With access to a machine with (many slow cores and) lots of ram I was able to run the 1e9 data set (which is limited: just two columns, one grouping variable). Some results are there for 1e6, 1e7, 1e8 and then 1e9. I was wondering if you wanted to take a look -- it appears as if 'relatively speaking' collapse does worse for the largest sizes.

Ok, so for full equivalence to data.table call set_collapse(sort = FALSE, na.rm = FALSE) before running the query. That being said, it is very strange to see collapse slower than standard dplyr, as group level vectorization at C-level should always be faster than executing an R function many times. Perhaps collapse needs to be attached for this to vectorize, but it should not be the case. Try if attaching it gives a performance increase. Also in general, yeah data.table is faster for aggregating very long data because collapse parallelism is not at the sub-column level.

Also, looking at this, what is the point of calling fungroup() after the aggreation. The data is not grouped anymore.

Final comment here: I see datasets::state.abb is the grouping variable with 50 character strings that is just recycled. First of all, sort = FALSE to fgroup_by() (or globally set) should give a substantial improvement hashing those strings rather than radix ordering them (which happens under the sort = TRUE default). Secondly. of course 50 groups with big data does not yield a huge performance gain from C-level vectorization. Using 1-10 million groups should make the value of collapse much more apparent. It's also clear that frameworks that use SIMD instructions at the sub-column level (such as polars, arrow, duckdb) are much faster than either collapse or data.table on these operations. WIth high cardinality (many groups), both collapse and data.table become substantially better compared to the state of the art.

what is the point of calling fungroup()

It's reuse by me of the original code. I did not write this, and as I am personally a little unfamiliar with collapse did not catch this. Concur regarding to '50 groups over 1 billion row' aspect: not ideal, but that's how the Java one started. It is what it is.

Fairly big difference from set_collapse() as you suggested:

> res <- reorderMicrobenchmarkResults(res)
> res
Unit: seconds
      expr      min       lq     mean   median       uq      max neval
 datatable 13.29352 13.99912 14.79667 14.41736 15.31928 17.46298    10
     dplyr 20.02083 21.05116 22.19490 21.44268 23.63253 26.15127    10
  collapse 21.43290 21.62876 22.95050 22.78372 24.07729 24.84746    10
    tapply 26.85309 27.37872 27.79212 27.40178 27.97962 29.96800    10
    lapply 32.41954 33.09230 33.71422 33.35813 34.37957 35.32526    10
        by 40.94393 42.19884 43.29755 42.72820 45.01552 45.41018    10
    polars 46.25011 47.32710 48.63922 48.81734 49.80694 51.65158    10
> 

I now also load all packages (but still call with fully qualified package::function() syntax). Will remove fungroup().

The results clearly show that the benchmark is non-sensical. If base R is faster than polars, why are we all developing these libraries after all. Anyway, I'm personally also a bit tired of these benchmarks doing basic aggregations on big data. collapse is a package for Advanced and Fast data transformation, and it is listed in the CRAN task views on econometrics, time series, and weighted statistics...

Completely agree. And the OP over at that repo also seems to have given up on it.