YosefLab/scib-metrics

ValueError: Each cell must have the same number of neighbors.

ernohanninen opened this issue ยท 18 comments

Hi,
I've implemented scib-metrics in my benchmarking pipeline. I tested my pipeline by subsetting three batches from my dataset and everything worked fine. However, when running the pipeline using exactly the same settings, but with the entire dataset I encountered an error, which I don't know how to solve.

Error message:

ValueError                                Traceback (most recent call last)
Cell In [3], line 31
     24 bm = Benchmarker(
     25     adata,
     26     batch_key=batch,
   (...)
     29     n_jobs=6,
     30 )
---> 31 bm.benchmark()

File ~/.conda/envs/PY_env/lib/python3.10/site-packages/scib_metrics/benchmark/_core.py:227, in Benchmarker.benchmark(self)
    224 if isinstance(use_metric_or_kwargs, dict):
    225     # Kwargs in this case
    226     metric_fn = partial(metric_fn, **use_metric_or_kwargs)
--> 227 metric_value = getattr(MetricAnnDataAPI, metric_name)(ad, metric_fn)
    228 # nmi/ari metrics return a dict
    229 if isinstance(metric_value, dict):

File ~/.conda/envs/PY_env/lib/python3.10/site-packages/scib_metrics/benchmark/_core.py:88, in MetricAnnDataAPI.<lambda>(ad, fn)
     86 nmi_ari_cluster_labels_kmeans = lambda ad, fn: fn(ad.X, ad.obs[_LABELS])
     87 silhouette_label = lambda ad, fn: fn(ad.X, ad.obs[_LABELS])
---> 88 clisi_knn = lambda ad, fn: fn(ad.obsp["90_distances"], ad.obs[_LABELS])
     89 graph_connectivity = lambda ad, fn: fn(ad.obsp["15_distances"], ad.obs[_LABELS])
     90 silhouette_batch = lambda ad, fn: fn(ad.X, ad.obs[_LABELS], ad.obs[_BATCH])

File ~/.conda/envs/PY_env/lib/python3.10/site-packages/scib_metrics/_lisi.py:98, in clisi_knn(X, labels, perplexity, scale)
     74 """Compute the cell-type local inverse simpson index (cLISI) for each cell :cite:p:`korsunsky2019harmony`.
     75 
     76 Returns a scaled version of the cLISI score for each cell, by default :cite:p:`luecken2022benchmarking`.
   (...)
     95     Array of shape (n_cells,) with the cLISI score for each cell.
     96 """
     97 labels = np.asarray(pd.Categorical(labels).codes)
---> 98 lisi = lisi_knn(X, labels, perplexity=perplexity)
     99 clisi = np.nanmedian(lisi)
    100 if scale:

File ~/.conda/envs/PY_env/lib/python3.10/site-packages/scib_metrics/_lisi.py:29, in lisi_knn(X, labels, perplexity)
      9 """Compute the local inverse simpson index (LISI) for each cell :cite:p:`korsunsky2019harmony`.
     10 
     11 Parameters
   (...)
     26     Array of shape (n_cells,) with the LISI score for each cell.
     27 """
     28 labels = np.asarray(pd.Categorical(labels).codes)
---> 29 knn_dists, knn_idx = convert_knn_graph_to_idx(X)
     31 if perplexity is None:
     32     perplexity = np.floor(knn_idx.shape[1] / 3)

File ~/.conda/envs/PY_env/lib/python3.10/site-packages/scib_metrics/utils/_utils.py:58, in convert_knn_graph_to_idx(X)
     56 n_neighbors = np.unique(X.nonzero()[0], return_counts=True)[1]
     57 if len(np.unique(n_neighbors)) > 1:
---> 58     raise ValueError("Each cell must have the same number of neighbors.")
     60 n_neighbors = int(np.unique(n_neighbors)[0])
     61 with warnings.catch_warnings():

ValueError: Each cell must have the same number of neighbors.

Here is my session info:

Click to view session information
-----
h5py                3.7.0
pandas              1.4.4
plottable           0.1.5
pymde               0.1.18
scanpy              1.9.1
scib                1.0.4
scib_metrics        0.1.1
scipy               1.9.3
scvi                0.19.0
session_info        1.0.0
-----
Click to view modules imported as dependencies
-----
IPython             8.6.0
jupyter_client      7.4.7
jupyter_core        4.11.2
jupyterlab          3.5.0
notebook            6.5.2
-----
Python 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0]
Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.17
-----

There's an issue with the nearest neighbor library that can result in redundant neighbors. As a temporary fix you might try randomly subsampling the data to e.g., 95% of the cells until it works. Sorry I don't have a more satisfying fix

lmcinnes/pynndescent#135

I might also consider switching to faiss or annoy over pynndescent

Please see the new tutorial here:

https://scib-metrics.readthedocs.io/en/stable/notebooks/large_scale.html

Now you can plug in any neighbor method you would like. Faiss-gpu is nice because it's gpu accelerated.

Thank you! I'll give it a shot.

Hi,
is there an update on this issue? I ran the code from the large scale tutorial but with my adata, and received the same error, also after randomly subsampling the data multiple times. My environment is:

anndata             0.8.0
faiss               1.7.2
numpy               1.23.5
scanpy              1.9.1
scib_metrics        0.2.0
torch               1.13.1+cu117

Interestingly, the lung example works but is not scalable. I tested that with a small subsample of my adata.
Thanks!

@richtertill are you using faiss brute force neighbors on GPU?

Interestingly, the lung example works but is not scalable. I tested that with a small subsample of my adata.

Can you define "not scalable"? Are you using a gpu? What's your compute environment?

@adamgayoso

are you using faiss brute force neighbors on GPU?

Yes, I'm using the faiss brute force neighbors as described in the large scale tutorial GPU, a Tesla V100 32GB with 75GB CPU-memory.

Can you define "not scalable"?

I haven't tested it extensively, but it didn't work with many cells. Though I'd have to check the limit if that's of interest.

Do you have jax installed for your gpu?

I haven't tested it extensively, but it didn't work with many cells. Though I'd have to check the limit if that's of interest.

Didn't work is different from not being scalable, no? What's not working? Detailing this would be helpful.

For reference, you can see in the tutorial that I ran about 900k cells in 48 minutes (on an RTX3090).

Also this error can occur if you have two cells with the same exact embedding

Furthermore,

bm = Benchmarker(
    adata,
    batch_key="batch",
    label_key="cell_type",
    embedding_obsm_keys=["Unintegrated", "Scanorama", "Harmony"],
    n_jobs=6,
)
bm.benchmark()

works for me on the lung example tutorial on my macbook pro in 2 minutes.

As this package is in early development I would encourage always checking for updates, as we are at v0.3.1 now.

Updating the package to 0.3.1 solved the issue for me. Thanks!

Also this error can occur if you have two cells with the same exact embedding

Hi,

I meet the same error with my data and I do find some cells have the same embeddings. How can I solve this problem? Can I just delete these cells?

for embed in ["unintegrated", "scANVI", "scVI", "CCA", "BBKNN", "Harmony", "LIGER", "Scanorama"]:
  print(embed)
  print(adata.obsm[embed].shape)
  print(np.unique(adata.obsm[embed], axis=0).shape)

And I got

unintegrated
(29935, 2)
(29935, 2)
scANVI
(29935, 30)
(29931, 30)
scVI
(29935, 30)
(29931, 30)
CCA
(29935, 40)
(29922, 40)
BBKNN
(29935, 2)
(29935, 2)
Harmony
(29935, 2)
(29935, 2)
LIGER
(29935, 30)
(29923, 30)
Scanorama
(29935, 100)
(29922, 100)

Thank you!

Also this error can occur if you have two cells with the same exact embedding

Hi,

I meet the same error with my data and I do find some cells have the same embeddings. How can I solve this problem? Can I just delete these cells?

I guess drop similar cells would be acceptable, as you only have at most 13/29935=0.04% replica.

I'm running into this error and have already removed duplicate embeddings, subsampled, and am using faiss instead of pynndescent. Any more pointers would be appreciated!

I'm running into this error and have already removed duplicate embeddings, subsampled, and am using faiss instead of pynndescent. Any more pointers would be appreciated!

I use CPU to calculate the neighbors instead the faiss based functions, and it works for my cases.

Hm switching to pynndescent worked!

Hm switching to pynndescent worked!

I would be grateful if you can share the code that switched to pynndescent with me?