lmcinnes/pynndescent

BUG: redundant edges in output when `low_memory=False`

Closed this issue · 10 comments

I was running some test sets as I try to start working on #63 and found this, which seems rather alarming as it will lower the effective k of the kNN by up to half. It appears to be data-dependent, as when I generated random data to make a reproducible example, it didn't always show up.

rng = np.random.RandomState(0)
some_data = rng.uniform(0, 1, size=(100, 100))

# just look at neighbors of first element
NNDescent(some_data, low_memory=False, random_state=rng).neighbor_graph[0][0]

outputs:

array([ 0, 31, 69, 94, 63, 59, 47, 64, 64, 51, 34, 95, 93, 38, 45, 45, 46,
       90, 17, 19, 53, 60, 80, 30, 30, 26, 26, 50, 50, 73])

Where 64, 45, 30, 26, and 50 are all repeated.

edit: Updated output to reflect current master. I had changes in my local repo but those were not responsible.

I decided to look into this...when I changed the use of flagged_heap_push to checked_flagged_heap_push in the function utils.apply_graph_updates_high_memory the redundant edges go away.

Also, the high memory version doesn't seem to be any faster on test data. What's it good for?

Thanks for the fix.

The low_memory=False is a holdover from earlier versions. It made a significant different on certain datasets (and proved extremely memory expensive on others). I haven't bothered to remove it because it can be useful at times.

I'm having the same issue with but with low_memory=True

Screenshot 2022-12-27 at 1 38 02 PM

That is a weird test 😂 for one thing, it doesn't tell us the background rate. For another, if there are multiple edges with 0 length I don't think it's guaranteed that k will the first edge for node k (i.e. it could be in i[1:], but i[0] != k so it's still a unique edge).

Do you see redundant edges in the actual output?

Yes there were redundant edges, sorry for the weird test.

Yes there were redundant edges, sorry for the weird test.

How was X generated? Can you share it or code to re-create it?

Now I can't remember, but it's either PCA or scVI embeddings of a single-cell dataset, sorry.

It's hard to debug without an example...I don't know how to reproduce this locally.

If it crops up again (e.g. in the issue linked above) and there is example data to try I can look into it.

I'll try to reproduce it, but as you can see in the test above, and based on my memory, the self edge was repeated, which seems like a very specific kind of bug as it's not a random edge.

I'll try to reproduce it, but as you can see in the test above, and based on my memory, the self edge was repeated, which seems like a very specific kind of bug as it's not a random edge.

The first index in each row is not guaranteed to be the self-edge, so the test above does not show this for sure. That's why it would be good to have a reproducible example to see what happened.