ropensci-archive/umapr

Consider *not* returning the input object attached to the embeddings

seaaan opened this issue ยท 5 comments

Doing this on large datasets with many variables requires returning a huge object which considerably slows it down and uses more memory.

Could make this optional behavior.

Maybe a test for size of data.frame, should be less than 1/2 of available memory?

Here's my proposal:

  • Accept numeric matrix or data frame.
  • Drop all non-numeric columns from data frame.
  • By default, cbind the UMAP dimensions with the original data frame (including the non-numeric columns so eg the column in iris identifying the species will be in the output, which is convenient)
    • Have the second argument to umap() be include_input or something like that which is a boolean allowing the user to control whether to return the UMAP dimensions combined with the original data or just the UMAP dimensions. Default to return the UMAP dimensions combined with the original data.

If we're returning the UMAP dimensions combined with the original data, we could test for how much available memory there is. I haven't done anything with testing for available memory, is that difficult?

What do you all think about that?

I think the first three points make sense. As for the last point (checking available memory), what would happen if there are enough memory? Error? Return only UMAP dimensions?

I don't have a huge preference, though I think it is good to at least have the option to return the original data.

I took a look at what Rtsne::Rtsne() returns, and it is a list of ~13 items, one of which is the matrix of new dimensions. The rest of the items are information about the t-sne run. It doesn't return the original data.

I am always annoyed by what I get back from Rtsne() or prcomp() because they require me to do some work to extract the results :) that is part of my motivation for wanting to do it differently here.