slowkow/harmonypy

ValueError: operands could not be broadcast together with shapes

DennisPost10 opened this issue · 9 comments

Hi there,
i ran harmonypy via scanpy external api on 12 different data sets and it worked for 11 of them.
Unfortunately, for one of them (contatining 1974 cells) I get the following error.

  File "/nfs/home/students/postd/.local/lib/python3.6/site-packages/harmonypy/harmony.py", line 126, in run_harmony
    epsilon_cluster, epsilon_harmony, nclust, block_size, lamb_mat, verbose
  File "/nfs/home/students/postd/.local/lib/python3.6/site-packages/harmonypy/harmony.py", line 172, in __init__
    self.init_cluster()
  File "/nfs/home/students/postd/.local/lib/python3.6/site-packages/harmonypy/harmony.py", line 195, in init_cluster
    self.R = self.R / self.sigma[:,None]
ValueError: operands could not be broadcast together with shapes (64,1974) (66,1) 

I tried again and got the same error with different shapes:
ValueError: operands could not be broadcast together with shapes (65,1974) (66,1)

I extracted all necessary code to reproduce the error and to find its cause.

km = kmeans(self.Z_cos.T, self.K, iter=10)
self.Y = km[0].T

I found that kmeans does not always return the expected number (self.K) of clusters (probably to centroids with no attached node?) which leads self.Y to be smaller than expected and also self.R.shape[0] will not match the length of self.sigma.
This finally leads to the errors above.
I guess i can just set 'nclust' to a smaller number as a param, but maybe i am not the only one with this problem.

Best,
Dennis

Hi Dennis, thanks for opening this issue! And thanks for investigating which lines might be the root cause of the error.

Did you find that setting nclust to a smaller number resolves your issue?

Have you considered whether the 12th dataset is representative of any of the other 11 datasets? For example, have you looked at a clustering result for each dataset independently?

Could I please ask if you have any proposal for what you think might be the best modification to the code? For example, would it be helpful to add a check how many clusters we got from kmeans(), and then to report an error to the user along with a suggestion that they might try a smaller value for nclust?

Hey there,
I think I have the same problem here, even with tuning the nclust down. The problem is that it always return the shape size smaller than defined nclust.
I think the problem is on initialize the centroid (line 88). It always return the smaller number of the cluster center.
For example,

km = kmean(self.Z_cos.T, 100, iter=10)
km[0].shape[0] = 87 (<100)

Any idea how to fix it?
Thank you so much!

Thank you @onionpork for bringing this up again.

The documentation for scipy.cluster.vq.kmeans describes the objects returned by kmeans():

Returns
codebook ndarray
A k by N array of k centroids. The ith centroid codebook[i] is represented with the code i. The centroids and codes generated represent the lowest distortion seen, not necessarily the globally minimal distortion. Note that the number of centroids is not necessarily the same as the k_or_guess parameter, because centroids assigned to no observations are removed during iterations.
distortion float
The mean (non-squared) Euclidean distance between the observations passed and the centroids generated. Note the difference to the standard definition of distortion in the context of the k-means algorithm, which is the sum of the squared distances.

So, the documentation is consistent with your experience: the kmeans function does not always return k centroids.

I think the best solution would be to choose a different implementation of kmeans that returns a consistent output. I might consider:

@onionpork @DennisPost10 I think the issue should be fixed by commit 2fd234e

I would be grateful if you could confirm that the new code works for your application.

@onionpork and @DennisPost10

Please feel free to comment on this issue if you would like to share your experience with the new code. Thanks.

bahnk commented

Hey @slowkow,

Just had the same issue with the pip version that doesn't use kmeans2. But it works well with the git version that uses kmeans2.

Cheers,

Thank you @bahnk for confirming that we should switch from kmeans() to kmeans2().

I'll make a new release soon.

Hi! I'm using harmony-you as part of panpipes and I am still facing this issue with harmonypy version 0.0.9

`2024-03-20 15:35:23,333 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans... \`
`2024-03-20 15:35:36,417 - harmonypy - INFO - sklearn.KMeans initialization complete. \`
` Traceback (most recent call last): \`
  `File "[...]/panpipes/panpipes/python_scripts/batch_correct_harmony.py", line 110, in <module> \`
    `ho = hm.run_harmony(adata.obsm[dimred][:,0:int(args.harmony_npcs)], \`
  `File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 124, in run_harmony \`
    `ho = Harmony( \`
  `File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 172, in __init__ \`
    `self.init_cluster() \`
  `File "[...]/conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 207, in init_cluster \`
    `self.compute_objective() \`
  `File "[...]//conda/skylake/envs/panpipes_gpu/lib/python3.10/site-packages/harmonypy/harmony.py", line 219, in compute_objective \`
    `w = np.dot(y * z, self.Phi) \`
`ValueError: operands could not be broadcast together with shapes (100,33) (100,34)  \
 \`

Might it have to do with switch of kmeans2()?

from sklearn.cluster import KMeans

@wlason

Feel free to switch to kmeans2 if you want to check if that might solve your issue. The test demonstrates how you can do that:

def cluster_fn(data, K):
centroid, label = kmeans2(data, K, minit='++', seed=0)
return centroid
def run(cluster_fn):
ho = hm.run_harmony(data_mat,
meta_data, ['donor'],
max_iter_harmony=2,
max_iter_kmeans=2,
cluster_fn=cluster_fn)
return ho.Z_corr

If you want more help, please consider sharing a minimal reproducible example, so others might have a chance to reproduce the error that you are facing. We don't need your data, any random data would be fine. But if this issue depends on your data, then you might consider sharing it privately with me so I can use it to debug.

I do not know if sklearn.cluster.KMeans is guaranteed to return the number of clusters that we asked for. I hope so, but I am not certain.

We switched to KMeans for faster execution time. But it seems that all of the implementations we've tried so far have some limitations. If you can suggest a specific implementation of a k-means algorithm that you think might be suitable, that would also be helpful.

Good luck!