CARP crash

Question

CARP crash

Closed this issue 6 years ago · 11 comments

Running into the following:

> carp_fit <- CARP(tf, labels = track_features$track_uri)
�X� should be a matrix, not a data.frame. Converting with as.matrix(). (Called from CARP)
Pre-computing weights and edge sets
Error in if (num_connected_old == num_connected) { : 
  missing value where TRUE/FALSE needed

Also, it's taking ~10 minutes just to get to this error on a 2600 by 11 matrix -- is that expected?

Can send data for a reprex if necessary.

Answer 1 · 2018-10-07T01:18:00.000Z

Thanks. Will check out Tuesday or Wednesday. Without profiling, I think that most of that slowness comes from running `dist` on your data since that check (for a connected weight graph) occurs before the actual CARP code starts. Data to reproduce would be useful since I haven’t seen this before. Counting the number of connected components shouldn’t be producing NA.... You can also send directly to me if it’s sensitive. M

…

On Sat, Oct 6, 2018 at 8:12 PM alex hayes ***@***.***> wrote: Running into the following: > carp_fit <- CARP(tf, labels = track_features$track_uri) �X� should be a matrix, not a data.frame. Converting with as.matrix(). (Called from CARP)Pre-computing weights and edge setsError in if (num_connected_old == num_connected) { : missing value where TRUE/FALSE needed Also, it's taking ~10 minutes just to get to this error on a 2600 by 11 matrix -- is that expected? Can send data for a reprex if necessary. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#54>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABau6YU0n9eZlanv7Y0JgHp4yQ-H3I7Sks5uiVTogaJpZM4XLoDX> .

Answer 2 · 2018-10-07T02:45:08.000Z

Data at: https://drive.google.com/file/d/1bBV2ECfmXhoaS6kcav1giSCG7QnjGZh8/view?usp=drivesdk

# this link is slightly different than the one above so that it works for direct downloads
track_features <- readr::read_csv("https://drive.google.com/uc?export=download&id=1OfewOmkgVpzgkr1_xvW2wepnD3wYdeDr")

tf <- track_features %>% 
  select_if(is.numeric)

carp_fit <- CARP(tf, labels = track_features$track_uri)

Answer 3 · 2018-10-10T15:10:49.000Z

Thanks for sending -- I'm playing with it now and finding a few things on the slowness.

I still haven't found the source of the error (because of the slowness)

i) the internal function is_connected_adjacency_mat function uses an algorithm which scales badly. (Essentially, it checks all paths of length 1, then all paths of length 2, etc to make sure everything is connected. There's some linear algebra to make this faster than it might seem, but obviously it's still not great...)

I can speed this up by re-using a more efficient algorithm already in the C++ code. This takes the check time from around 30s to 0.2s at each iteration.

ii) Our default weighting scheme works by picking the smallest k-nearest-neighbor graph that is still fully connected. (See https://dataslingers.github.io/clustRviz/articles/ClustRVizWeights.html)

For this data, it looks like we need a very high value of k (still running, but at least > 250), which means we have lots of weights / edges to handle, which makes everything else slow.

This seems to be caused by a few outliers which are not in the nearest neighbors of anything else. (Take a look at pairs((tf %*% prcomp(tf)$rotation)[,1:3]) to see what I mean)

iii) We find the smallest value of k by starting at k = 1, then trying k = 2, etc until we find something that works. This might not be optimally efficient for cases when the desired k is large. I can think about a bisection heuristic which might speed things up.

Answer 4 · 2018-10-10T15:27:45.000Z

I might have spoken a bit too soon re: ii above.

I think a small k would give a connected graph, but the weight (inversely proportional to the distance from the outlier to the bulk) is getting rounded to zero somewhere. Still diving into it...

Answer 5 · 2018-10-10T15:43:21.000Z

I think there are some duplicated data points in case that's a concern.

Answer 6 · 2018-10-10T15:43:53.000Z

I did try running things with the duplicates removed but I was still crashing.

Answer 7 · 2018-10-10T15:53:56.000Z

Ok - I've found the problem that leads to the immediate error and it's essentially a numeric overflow getting turned into a NA which breaks a conditional. That's an easy fix since my new check for connectedness doesn't use multiplications.

Still working on the apparent need for large k problem

Answer 8 · 2018-10-10T16:10:56.000Z

Ok -- I've found the problem:

library(dplyr)
library(readr)
library(matrixStats)
track_features <- readr::read_csv("https://drive.google.com/uc?export=download&id=1OfewOmkgVpzgkr1_xvW2wepnD3wYdeDr")
tf <- track_features %>% select_if(is.numeric) %>% as.matrix

dtf <- as.matrix(dist(tf))
diag(dtf) <- Inf # Essentially ignore the diagonal, b/c weights to self are meaningless
big_distance <- max(colMins(dtf))

stopifnot(exp(-big_distance) == 0)

Essentially, for (at least) one observation, it's so far away from any other point that using our standard weighting scheme of w_ij = exp(-d_ij^2), all of its weights are set to zero and it cannot be clustered.

I can definitely catch this earlier and give a clearer error message, but I'm not sure about the best way to handle it if we do want to cluster this data. Saying your data is too well separated to cluster seems like a pretty pathetic cop-out...

In this case, standardizing the columns of X seems to fix it, but I'm not sure if that's correct here (given the domain) or a generally robust answer.

Answer 9 · 2018-10-10T16:27:18.000Z

Partial fix (for the speed stuff, not the zero weight issue) in #55.

Answer 10 · 2018-10-10T18:57:39.000Z

I've added a check for totally disconnected vertices as part of PR #55 -- it's not a full solution, since it only moves the error earlier (avoids checking all possible graph sparsification levels) but should be a bit more user friendly.

Answer 11 · 2018-10-11T14:14:58.000Z

Partial fix merged in #55. Opening a new issue for a more permanent solution.