DataSlingers/clustRviz

CARP crash

Closed this issue · 11 comments

Running into the following:

> carp_fit <- CARP(tf, labels = track_features$track_uri)
�Xshould be a matrix, not a data.frame. Converting with as.matrix(). (Called from CARP)
Pre-computing weights and edge sets
Error in if (num_connected_old == num_connected) { : 
  missing value where TRUE/FALSE needed

Also, it's taking ~10 minutes just to get to this error on a 2600 by 11 matrix -- is that expected?

Can send data for a reprex if necessary.

Data at: https://drive.google.com/file/d/1bBV2ECfmXhoaS6kcav1giSCG7QnjGZh8/view?usp=drivesdk

# this link is slightly different than the one above so that it works for direct downloads
track_features <- readr::read_csv("https://drive.google.com/uc?export=download&id=1OfewOmkgVpzgkr1_xvW2wepnD3wYdeDr")

tf <- track_features %>% 
  select_if(is.numeric)

carp_fit <- CARP(tf, labels = track_features$track_uri)

Thanks for sending -- I'm playing with it now and finding a few things on the slowness.

I still haven't found the source of the error (because of the slowness)

i) the internal function is_connected_adjacency_mat function uses an algorithm which scales badly. (Essentially, it checks all paths of length 1, then all paths of length 2, etc to make sure everything is connected. There's some linear algebra to make this faster than it might seem, but obviously it's still not great...)

I can speed this up by re-using a more efficient algorithm already in the C++ code. This takes the check time from around 30s to 0.2s at each iteration.

ii) Our default weighting scheme works by picking the smallest k-nearest-neighbor graph that is still fully connected. (See https://dataslingers.github.io/clustRviz/articles/ClustRVizWeights.html)

For this data, it looks like we need a very high value of k (still running, but at least > 250), which means we have lots of weights / edges to handle, which makes everything else slow.

This seems to be caused by a few outliers which are not in the nearest neighbors of anything else. (Take a look at pairs((tf %*% prcomp(tf)$rotation)[,1:3]) to see what I mean)

iii) We find the smallest value of k by starting at k = 1, then trying k = 2, etc until we find something that works. This might not be optimally efficient for cases when the desired k is large. I can think about a bisection heuristic which might speed things up.

I might have spoken a bit too soon re: ii above.

I think a small k would give a connected graph, but the weight (inversely proportional to the distance from the outlier to the bulk) is getting rounded to zero somewhere. Still diving into it...

I think there are some duplicated data points in case that's a concern.

I did try running things with the duplicates removed but I was still crashing.

Ok - I've found the problem that leads to the immediate error and it's essentially a numeric overflow getting turned into a NA which breaks a conditional. That's an easy fix since my new check for connectedness doesn't use multiplications.

Still working on the apparent need for large k problem

Ok -- I've found the problem:

library(dplyr)
library(readr)
library(matrixStats)
track_features <- readr::read_csv("https://drive.google.com/uc?export=download&id=1OfewOmkgVpzgkr1_xvW2wepnD3wYdeDr")
tf <- track_features %>% select_if(is.numeric) %>% as.matrix

dtf <- as.matrix(dist(tf))
diag(dtf) <- Inf # Essentially ignore the diagonal, b/c weights to self are meaningless
big_distance <- max(colMins(dtf))

stopifnot(exp(-big_distance) == 0)

Essentially, for (at least) one observation, it's so far away from any other point that using our standard weighting scheme of w_ij = exp(-d_ij^2), all of its weights are set to zero and it cannot be clustered.

I can definitely catch this earlier and give a clearer error message, but I'm not sure about the best way to handle it if we do want to cluster this data. Saying your data is too well separated to cluster seems like a pretty pathetic cop-out...

In this case, standardizing the columns of X seems to fix it, but I'm not sure if that's correct here (given the domain) or a generally robust answer.

Partial fix (for the speed stuff, not the zero weight issue) in #55.

I've added a check for totally disconnected vertices as part of PR #55 -- it's not a full solution, since it only moves the error earlier (avoids checking all possible graph sparsification levels) but should be a bit more user friendly.

Partial fix merged in #55. Opening a new issue for a more permanent solution.