r-lib/vctrs

Remapping proxies and example of rle class

Opened this issue · 13 comments

i.e. compress a numeric vector using rle. vec_proxy() and vec_restore() are trivial to implement, but you can get better performance for (e.g. vec_match()) by implementing a specific method.

I may be missing something, but it looks like we need to add native RLE support to vctrs. Otherwise we need to expand the proxy to full size.

Or maybe we just need to add a customisation point for mapping slice indices from their value-space to the proxy-space? This would also be useful for implementing sparse types like Matrix classes, e.g. tidyverse/tibble#196

This is just an example for the vignette. You’d show how to provide a method for something other than the proxy in order to improve performance for a specific case.

How can RLE classes be implemented with the current vctrs API?

Oh because vec_match() is no longer generic?

vec_match() is not generic, but I was wondering about slicing indices. It seems RLE and sparse classes need a way of mapping the indices from value-space to proxy-space, since the proxy is compressed. For instance the proxy might be size 3 even though the actual vector is size 10, and slicing index 5 is valid.

Regarding slicing of RLE proxies, we also need a way to slice non-contiguous groups, such as:

vec_group_rle(mtcars[c("cyl", "am"))
#> <vctrs_group_rle[17][n = 6]>
#>  [1] 1x2 2x1 3x1 4x1 3x1 4x1 5x2 3x2 4x6 2x3 5x1 4x4 2x3 6x1 1x1 6x1 2x1

This could also be handled by a function that maps indices from value-space to proxy-space.

There is a conceptual issue with RLE vectors: The vector has a smaller size than its actual size. Then how can data frames and their proxies contain RLE vectors?

Maybe the actual RLE value should be a vec_na() vector, which can be implemented as an ALTREP class in R >= 3.5. Then foreign code like data.frame() can check the size constrain.

Regarding the proxy, we still have the problem that data frames can't contain smaller vectors, so vec_proxy(data.frame(x = rle(...)) is still problematic. Maybe the proxy also needs to be an ALTREP vector, which somewhat defeats the purpose of the class.

Our position could be that if you want to have a vector where vec_size(vec_data(x)) != vec_size(x) then you need to do it in ALTREP

Do we also need to add a hook for vec_size() or look for a length() method? Tried to wrap sparse vectors from {Matrix} and got stuck there. (Internal {vctrs4} package.)

See https://github.com/lionel-/RER/pull/1

This is probably too much work for the next version of vctrs. Will have to wait for the next.

Nice writeup!

boom! been looking for a non-dplyr group_indices, thanks! I love the logical separation of the group_by machinery into those simple functions, brilliant. Nearly 2X speedup right there.

(btw @lionel- you can go quite a long way with R's numeric comparison for geospatial equality - way further than I thought possible - so I'm interested by that note in your write up)

Note that some functions of this API will be renamed in vctrs 0.4.0.