YosefLab/scib-metrics

Benchmarking large-scale integration can't be accerlated

Closed this issue · 18 comments

I have run the tutorial jupyter https://scib-metrics.readthedocs.io/en/stable/notebooks/large_scale.html, but I have cost three hours to calculate 0% of KNN neighbors in function faiss_brute_force_nn . I have installed faiss-gpu without error.

How many cells do you have? Can you provide all the code you've run and your environment/compute details?

How many cells do you have? Can you provide all the code you've run and your environment/compute details?

I run the lung atalas dataset which contains an AnnData object with n_obs × n_vars = 892296 × 17811

-----
anndata     0.8.0
scanpy      1.9.1
-----
PIL                 9.4.0
asttokens           NA
backcall            0.2.0
colorama            0.4.6
comm                0.1.2
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.2
debugpy             1.6.6
decorator           5.1.1
executing           1.2.0
google              NA
h5py                3.8.0
igraph              0.10.4
ipykernel           6.21.1
jedi                0.18.2
joblib              1.2.0
kiwisolver          1.4.4
leidenalg           0.9.1
llvmlite            0.39.1
louvain             0.8.0
matplotlib          3.6.3
mpl_toolkits        NA
natsort             8.2.0
numba               0.56.4
numexpr             2.8.4
numpy               1.22.3
packaging           23.0
pandas              1.5.3
parso               0.8.3
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
platformdirs        3.0.0
prompt_toolkit      3.0.36
psutil              5.9.4
ptyprocess          0.7.0
pure_eval           0.2.2
pydev_ipython       NA
pydevconsole        NA
pydevd              2.9.5
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA
pygments            2.14.0
pyparsing           3.0.9
pytz                2022.7.1
scipy               1.10.0
session_info        1.0.0
setuptools          65.6.3
six                 1.16.0
sklearn             1.2.1
stack_data          0.6.2
texttable           1.6.7
threadpoolctl       3.1.0
tornado             6.2
traitlets           5.9.0
typing_extensions   NA
wcwidth             0.2.6
yaml                6.0
zipp                NA
zmq                 25.0.0
zoneinfo            NA
-----
IPython             8.9.0
jupyter_client      8.0.2
jupyter_core        5.2.0
-----
Python 3.9.16 (main, Jan 11 2023, 16:05:54) [GCC 11.2.0]
Linux-4.4.0-210-generic-x86_64-with-glibc2.23
-----
Session information updated at 2023-02-28 09:04


Can you provide your GPU details?

Can you provide your GPU details?

image

Your cuda version looks quite old. Are you sure faiss gpu and/or jax can see the gpu?

I am not sure, How to test whether faiss gpu and jax is used or not when executing the function faiss_brute_force_nn

wehos commented

Hello, I have encountered a similar issue. My CUDA version is 11.8. When running faiss_brute_force_nn, I can see GPU memory occupation with zero usage rate. It seems like faiss cannot correctly run on my side, while I have no clues to address this issue.

The overall pipeline is also much slower than I expected. After switching to pynndescent, I managed to finish the prepare step (knn building). While in benchmark step, it takes 5 hours to evaluate one model, which is much slower than reported in the tutorial. I suspect the running speed is correlated with the latent dimension. Could you please kindly specify the dimension of your adata.obsm["X_scVI"]?

These are about 20 dim, though I wouldn't expect something like 100dim to be much slower.

Can you print the output of

import jax
jax.devices()

?

Hello, I have encountered a similar issue. My CUDA version is 11.8. When running faiss_brute_force_nn, I can see GPU memory occupation with zero usage rate. It seems like faiss cannot correctly run on my side, while I have no clues to address this issue.

The overall pipeline is also much slower than I expected. After switching to pynndescent, I managed to finish the prepare step (knn building). While in benchmark step, it takes 5 hours to evaluate one model, which is much slower than reported in the tutorial. I suspect the running speed is correlated with the latent dimension. Could you please kindly specify the dimension of your adata.obsm["X_scVI"]?

I have the same issue as you. when I ran faiss_hnsw_nn instead of faiss_brute_force_nn, the speed of calculating KNN is very fast, but scib_metircs throws error when calculating metrics

wehos commented

These are about 20 dim, though I wouldn't expect something like 100dim to be much slower.

Can you print the output of

import jax
jax.devices()

?

I reinstall jax so that now it properly display my GPU information:
>>> jax.devices()
[StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=1, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=2, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=3, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=4, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=5, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=6, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=7, process_index=0, slice_index=0)]

However, the GPU usage rate for the nearest neighbor algorithm is still zero.

wehos commented

Hello, I have encountered a similar issue. My CUDA version is 11.8. When running faiss_brute_force_nn, I can see GPU memory occupation with zero usage rate. It seems like faiss cannot correctly run on my side, while I have no clues to address this issue.
The overall pipeline is also much slower than I expected. After switching to pynndescent, I managed to finish the prepare step (knn building). While in benchmark step, it takes 5 hours to evaluate one model, which is much slower than reported in the tutorial. I suspect the running speed is correlated with the latent dimension. Could you please kindly specify the dimension of your adata.obsm["X_scVI"]?

I have the same issue as you. when I ran faiss_hnsw_nn instead of faiss_brute_force_nn, the speed of calculating KNN is very fast, but scib_metircs throws error when calculating metrics

Indeed. I just tried 'faiss_hnsw_nn' as well. It's fast since it's an ANN. However when calculating metrics, it throws errors including:

(1) Loaded runtime CuDNN library: 8.3.3 but source was compiled with: 8.6.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

(2) File /egr/research-dselab/wenhongz/miniconda3/envs/scib/lib/python3.9/site-packages/jax/_src/dispatch.py:1030, in backend_compile(backend, built_c, options, host_callbacks)
1025 return backend.compile(built_c, compile_options=options,
1026 host_callbacks=host_callbacks)
1027 # Some backends don't have host_callbacks option yet
1028 # TODO(sharadmv): remove this fallback when all backends allow compile
1029 # to take in host_callbacks
-> 1030 return backend.compile(built_c, compile_options=options)

XlaRuntimeError: INTERNAL: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc:627) dnn != nullptr

I am trying to resolve them.

====================Update==============

After resolve the JAX error, it now throws an error:
ValueError: Each cell must have the same number of neighbors.

I suppose the metric does not support ANNs like HNSW.

@wehos I really can't help with debugging faiss installation. I can say installing with conda worked for me and it's using the GPU.

I would like to focus this issue on the metrics potentially being slow.

@wehos can you share the dimension of the latent arrays? The time you see on the tutorial is from one RTX3090 gpu and latents were about 20dim

I opened up allowing custom nearest neighbors methods so that one could use any method. RAPIDS also has gpu accelerated nearest neighbors if you can manage to install it.

wehos commented

@wehos I really can't help with debugging faiss installation. I can say installing with conda worked for me and it's using the GPU.

I would like to focus this issue on the metrics potentially being slow.

@wehos can you share the dimension of the latent arrays? The time you see on the tutorial is from one RTX3090 gpu and latents were about 20dim

Hi @adamgayoso. I just reproduced the original tutorial. Though I still not being able to accelerate knn, the pynndescent takes 43 minutes and overall it runs 122 mins. Although it's still far more slower than the original result, I tend to believe this is due to the CPU performance difference.

Regarding my previous report (5 hours evaluating one model), it is probably due to the temporary congestion of server CPUs.
My apology for the rushed reporting.

I think it's great to receive the reports, I just want to understand fully what's happening :)

For knn, you can write a method that uses rapids as I linked above if you are having trouble with faiss.

Came across this due to the RAPIDS mention. RAPIDS cuML now provides experimental support for CPU execution for an initial set of estimators (including NearestNeighbors). You can install and prototype on a laptop or other machine without an NVIDIA GPU by installing the cuml-cpu package and then use the same code when you have access to a GPU by installing the cuml package. The cuML documentation now includes an example notebook and the v23.02 release blog has more information.

@adamgayoso Hey, we just ran into this problem as well using this fine package.
I noticed some potential issue in the faiss code provided in the tutorial notebook:
Both faiss kNN functions contain these three lines of code

  gpu_index = faiss.index_cpu_to_gpu(res, 0, index)
  index.add(X)
  distances, indices = index.search(X, k)

To properly run this with GPU acceleration, shouldn't the second and third line also reference gpu_index rather than index?
At least for us this seems to fix the issue of no GPU utilization when using faiss_brute_force_nn() and dramatically speeds up the computation.

Thanks @le-ander this should be fixed in #92