tidyverse/tibble

Allow sparse vectors in tibbles?

Closed this issue · 19 comments

How difficult would it be to allow for sparse vectors as part of tibbles?

For instance

> tibble(a = 1:10, b = Matrix::sparseVector(c(T, T, T), c(2, 5, 9), 10))
Error: Each variable must be a 1d atomic vector or list.
Problem variables: 'b'

My primary interest is in space efficient logical vectors.

Oh. I now see that tibble is a very lightweight wrapper around data.frame. So I guess the above is doomed.

We could do that, but probably not until we've switched to vctrs, and support for sparse columns is added there. What happens if you call as_tibble() on such a data frame?

Actually it works. If I remove the 1d check in dataframe.R I get:

> tt <- tibble(a = 1:10, b = sparseVector(c(T, T, T), c(2, 5, 9), 10))
> dd <- cbind(tt, tt)
> str(dd)
'data.frame':	10 obs. of  4 variables:
 $ a: int  1 2 3 4 5 6 7 8 9 10
 $ b:Formal class 'lsparseVector' [package "Matrix"] with 3 slots
  .. ..@ x     : logi  TRUE TRUE TRUE
  .. ..@ length: num 10
  .. ..@ i     : num  2 5 9
 $ a: int  1 2 3 4 5 6 7 8 9 10
 $ b:Formal class 'lsparseVector' [package "Matrix"] with 3 slots
  .. ..@ x     : logi  TRUE TRUE TRUE
  .. ..@ length: num 10
  .. ..@ i     : num  2 5 9
> dd[1, "b"]
sparse vector (nnz/length = 0/1) of class "lsparseVector"
[1] .

The problem is with the print method:

> tt
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L,  (from utils-format.r#89) : 
  first argument must be atomic

But that's easy to fix I would guess.

I can look deeper into it if there is interest.

Adjusting is_vector to

is_vector <- function(x) {
  is_atomic(x) || is.list(x) || inherits(x, "sparseVector")
}

is enough to fix the constructor.

I'd prefer if this was tackled with r-lib/vctrs#27, then tibbles get to support this data type once we switch to vctrs. For now you can probably work around at your own risk with as_tibble(validate = FALSE) .

I don't think you are right about passing the ball to that project. These two projects have orthogonal concerns.

The number of data structures supported by vctrs is limited by the design, because the whole set of vctrs operations must be re-implemented for each vector type. Then there will be an exponentially growing set of cross-type operations. If the aim is also to be high performance, then there is no way you can go beyond 2-3 core supported types. Look how bloated Matrix package is.

The core task of tibble is to store column data. Potentially arbitrary types of data could be stored as columns as long as those types support basic protocol (length, [, [[, what else?). Why would one want to artificially restrict this abstraction to vectrs?

Finaly the scope of that project is pretty undefined at the moment and it might take years till it finally takes shape.

I'd like to refrain from supporting in tibble anything that might bring dplyr in trouble. The support for nonstandard data types in dplyr is rather weak, to improve it we need a solid foundation like vctrs. So, currently you'll have to use validate = FALSE to override tibble's protection mechanisms, and be aware of potential consequences.

I don't see the issue of "exponential growth of cross-type operations" because vctrs will be very strict about what can be combined.

@hadley: Could you please comment?

Vctrs is where we will be thinking about "vector type" objects including thinking about what the generic API should be. A data frame/tibble is just a little of vectors of equal lengths. So this discussion definitely belongs in vctrs. I expect the bulk of vctrs to be complete by end of next year, and the a lot of the API design will be complete in the next 3 months.

@mmaechler: Now that vctrs is gaining traction, what's the best way to support sparse vectors and matrices in data frames or tibbles? Currently I see:

tibble::tibble(a = 1:10, b = Matrix::sparseVector(c(T, T, T), c(2, 5, 9), 10))
#> All columns in a tibble must be 1d or 2d objects:
#> * Column `b` is lsparseVector
data.frame(a = 1:10, b = Matrix::sparseVector(c(T, T, T), c(2, 5, 9), 10))
#> Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class 'structure("lsparseVector", package = "Matrix")' to a data.frame

Created on 2019-08-07 by the reprex package (v0.3.0)

Interesting: As Matrix::sparseVector() and the corresponding small class hierarchy have been very stable for many years, this may be an interesting test case. I would not go for "sparseMatrix" yet, as the class hierarchy is much much more sophisticated there.

As vectrs is not even on CRAN, it's too early for me to look further. For me, such sparse objects have always belonged to matrices rather than data frames, as matrices are much more efficient for these for what typical usage is, but then you seem to have reasonable use cases? To have as.data.frame.default() work for "sparseVector" objects (there are several: it's a hierarchy) seems to be asked much, but maybe there are other lower-level auxiliary methods that would be sufficient to you ?

Thanks, Martin.

{vctrs} 0.2.0 is on CRAN. The relevant text is https://vctrs.r-lib.org/articles/s3-vector.html, it's talking about S3 vectors but perhaps we can use S3 dispatch for the S4 classes?

Outside of vctrs, we could see if we can implement an as.data.frame.lsparseVector() S3 method for the sparse vector class; perhaps we don't need to change base for this?

I suspect it's sufficient to implement methods for the base class(es).

We could start the work in a package that imports {Matrix} and then incorporate if desired.

@lionel-: What would be a good vec_proxy() implementation for the "lsparseVector" class?

I'm not familiar with sparse implementations. Is it similar to run-length encoding? Maybe a record vector would be appropriate?

It's an S4 class with two vector and one scalar components.

x <- Matrix::sparseVector(1:3, 3:5, 7)
x
#> sparse vector (nnz/length = 3/7) of class "isparseVector"
#> [1] . . 1 2 3 . .
as.numeric(x)
#> [1] 0 0 1 2 3 0 0
x@x
#> [1] 1 2 3
x@i
#> [1] 3 4 5
x@length
#> [1] 7

Created on 2019-11-01 by the reprex package (v0.3.0)

Thanks, I'll look into it once I'm done with tidyselect. I'm going to implement methods for third party packages like sf. By the way, I think the tibble methods should also stay in vctrs for a while, because there will probably be API changes in the coming year.

Does this also mean that 98cac0d (and other packages we discover) belongs in vctrs? Happy to submit PRs.

Are we still recommending the "list subclass" approach, where we treat each S3 class that explicitly inherits from "list" as a vector? I forgot where I read about it.

If it's an important class feel free to submit a PR implementing methods. Perhaps after I have implemented sf, so that we have figured out the layout, conventions, and testing setup.

We're definitely going to do the explicit list inheritance approach, that's the first thing I'll tackle when I get back to work on vctrs, which I plan to this week.

Requires r-lib/vctrs#332, r-lib/vctrs#550 and probably r-lib/vctrs#551, before we can start adding vctrs support to {Matrix} classes. Once the following returns TRUE we're good to go:

vctrs::vec_is(Matrix::sparseVector(integer(), integer(), 0))
#> [1] FALSE

Created on 2020-03-11 by the reprex package (v0.3.0)

Even then this would have to happen in a different place, I'm thinking about a {mtrx} package which could eventually be incorporated into {Matrix}.

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.