DataSlingers/clustRviz

Formula interface

Opened this issue · 8 comments

Please enable my laziness.

When a dataset has a mixed data types it can be a pain to pass an explicit matrix and then rejoin on that matrix down the line.

Proposals:

# assuming there's some nice column called labels

CARP(labels ~ Sepal.Width + Sepal.Length, data = iris)  

# or
CARP(formula = ~ Sepal.Length + Sepal.Width, data = iris, labels = ~ labels)

Generally this will be useful to anybody following the current tidyverse recommendations to keep observation names in a column rather than in rownames().

I was thinking that the formula method would really be more a convenient way to subset than anything, and you'd still use model.matrix() to turn the dataframe into a matrix, erroring out if the selections where bad (i.e. included non-numerics, or something not coercable to transposable matrix).

Not as familiar with CBASS though so I could be missing something obvious here.

The main use of CBASS are situations in which the rows and columns are interchangeable -- CBASS(X) is conceptually equal to CBASS(t(X)) -- so it's not quite a "tidy" input (in the sense of the tidyverse -- matrices are very tidy mathematically). You can think of an example where you have a gene-by-patient matrix - it's perfectly reasonable to cluster "both ways."

I feel like I read something in the tidyverse docs along the lines of "if it makes sense to transpose it, use a matrix and not a data frame" but I can't find it now...

I'm not sure what the formula interface would look like here since there's not a sense of a "observation label" or a "response." Of the formula interfaces you give for CARP I prefer the first one, but I'm not sure how to extend it to biclustering - as is, it makes the row labels (species) look privileged over the column labels (Sepal.Length and Sepal.Width).

Hmmm. Following now. Yeah, not sure how to deal with this.

Looking at this again: I wonder if your first suggestion CARP(label ~ data1 + data2 + data3) would work for bi-clustering as well (CBASS(label ~ data1 + data2 + data3)).

Internally it would create a matrix X such that

X[,1] = data$data1
X[,2] = data$data2
X[,3] = data$data3
rownames(X) = data$label
colnames(X) = c("data1", "data2", "data3")

which is all we need for biclustering.

It's still a bit asymmetric, but I think it makes as much sense as anything else.

Thoughts @alexpghayes?

I'm still trying to figure out what the tidyverse way of transposable data is. (Or - even bigger: tidy tensors!) Would folks actually store data like this or would it be in a "long" form and an interface like CBASS(response ~ row_label + column_label) make more sense?

So my thoughts on this are basically that you have a dataframe and that the rows represent observations and the columns represented covariates. Almost always, more data is better. That is, the vast majority of the time, you want to use all of your rows, even if you want to cluster the columns in addition to the rows. If you want to fit to subsets of your rows, dplyr::filter() has your back. Or you can do a nest-purrr::map-unnest() thing if you want to fit many models on different subsets all at once. Anyway, I don't think that people really want to manipulate the rows.

However, people very often want to manipulate columns. The simplest example is when you collect lots of covariates but only think a couple of them will be important for clustering. In general, people want to do model.matrix()-type things to the feature space quite often. So this type of transformation (and also column subsetting) should be privileged in the interface. The formula should manipulate the columns, even if the columns are also being clustered.

Perhaps there are some workflows where some sort of expansions, aggregrations or interactions amongst the rows are useful. In this case you probably need a whole new syntax for this anyway. Maybe I'm missing some basic intuition about a kind of data where the rows versus the columns are genuinely interchangeable, but I'm having a hard time imaging one. The closest I can think of is the adjacency matrix of a directed graph, or raw image data, but you wouldn't want to bi-cluster this in raw form anyway.

Would folks actually store data like this or would it be in a "long" form and an interface like CBASS(response ~ row_label + column_label) make more sense?

Highly doubt it. Especially since the people like to do bi-clustering are probably not the tidiest demographic. I would stick with a standard wide data format.

Hmmmmmm.... If the only thing people want to do with the formula interface is select columns, I'm not really sure that a formula interface is much more useful than data_frame %>% select(....) %>% mutate(...) %>% dplyring(...) %>% as.matrix %>% CARP. In general, the tidyselect language is a lot more flexible than the formula language.

There's one small hiccup I see: It's not clear how one gets row labels from a character column to the rownames of the output of as.matrix. There could be some value in a small helper here:

as_matrix.data.frame <- function(x, nm = seq_len(NROW(x)){
   x_mat <- as.matrix(x %>% select(-!!nm))
   rownames(x_mat) = with(x, nm) # Not exactly this, some sort of tidy eval thing...
   x_mat

but that's not exactly within the scope of this package.

Re: long format. Yeah - the tidyverse isn't fully set up for that kind of analysis. Think of, e.g., a time series where you've done a short-time Fourier transform to get a spectrogram (sliding windows of Fourier transforms). The long-form of tidy data be something like

data.frame(time = numeric(), frequency = numeric(), power = numeric(), phase = numeric()

but I don't think anyone actually does that. (Think of time and frequency as (i, j) coordinates of a data matrix). This generalizes nicely to tensors, but I don't think the tidyverse crowd works with this. (I could maybe see the tidyverts folks getting to something like this eventually.)

All in all, it sounds better to let this stew for a while and see what, if any, use cases there would be for a formula / data frame interface.