seung-lab/connected-components-3d

List of label indices?

atpolonsky opened this issue · 7 comments

Is there a way to return a list of tuples/arrays that contains the indices of each label within the 3D array? Other approaches I've implemented using numpy or pandas can get quite memory intensive and are slow even in vectorized form. I was hoping there might already be such a list accessible through this module that can be created while the labelling algorithm is being executed. You cc3d labelling has no issue labelling a volume on the order of 3.5 billion voxels with ~40 million unique labels, but trying to get the indices of each label to perform subsequent operations has proved challenging.

If I understand correctly, is it that you'd like to convert the array into a point cloud? If you want to return every voxel as a 3-tuple, while there might be some leeway by playing with the data type, you should expect to use 3x the size of the array at minimum because you'll need 3 integers to represent the position of the label whereas before the position was stored implicitly using a single integer.

You could think of it as a point cloud, yes where each label location is represented by a 3 tuple. I'm looking to replace this kind of operation, which is shown below in its simplest, least efficient implementation:

connectivity = 6
labels_out = cc3d.connected_components(data, connectivity=connectivity)
labels_idx = []
for i in range(1,np.max(labels_out)+1):
idx = np.argwhere(labels_out == i)
labels_idx.append(idx)

I ran your example code against 128 MVx of data and it looks like it will take about three hours to run.

image

This sounds like a good candidate for a new fastremap function. Point clouds are used a lot in my field, it's a little crazy they're so expensive to generate.

This is a rough implementation, but maybe you'll be able to use it?

https://github.com/seung-lab/fastremap/tree/wms_point_cloud

Your function does work, though it appears to leave a thread hanging so I needed to kill the python script when I used it. I tested it out against a vectorized approach I found here: https://stackoverflow.com/a/54736464.

I subbed in your version of fastremap.unqiue in place of numpy's, and with a test volume of 500 million voxels and 3000 labels, the linked function above uses ~25GB of ram and runs in 55 seconds. The fastremap point_cloud function uses ~60 GB of RAM (borrowing about half of this from virtual memory on my hardware) and runs in 100 seconds.

I agree that this kind of function is definitely useful for my field as well and from looking into it I think others would benefit from a faster implementation. Either way, a dramatic improvement from the simple for loop above.

It's been a while, but I am releasing fastremap 1.12.0 which will include a point_cloud function that should be much faster and lower memory than what I previously showed you, maybe even better than the numpy vectorized. The measuring tools on my new arm64 laptop are not reliable in giving memory estimates, so it's difficult for me to make more quantitative statements.

Closing issue due to inactivity. Please reopen if you want to discuss further!