Benchmarking large-scale integration can't be accerlated
Closed this issue · 18 comments
I have run the tutorial jupyter https://scib-metrics.readthedocs.io/en/stable/notebooks/large_scale.html, but I have cost three hours to calculate 0% of KNN neighbors in function faiss_brute_force_nn
. I have installed faiss-gpu without error.
How many cells do you have? Can you provide all the code you've run and your environment/compute details?
How many cells do you have? Can you provide all the code you've run and your environment/compute details?
I run the lung atalas dataset which contains an AnnData object with n_obs × n_vars = 892296 × 17811
-----
anndata 0.8.0
scanpy 1.9.1
-----
PIL 9.4.0
asttokens NA
backcall 0.2.0
colorama 0.4.6
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 5.1.1
executing 1.2.0
google NA
h5py 3.8.0
igraph 0.10.4
ipykernel 6.21.1
jedi 0.18.2
joblib 1.2.0
kiwisolver 1.4.4
leidenalg 0.9.1
llvmlite 0.39.1
louvain 0.8.0
matplotlib 3.6.3
mpl_toolkits NA
natsort 8.2.0
numba 0.56.4
numexpr 2.8.4
numpy 1.22.3
packaging 23.0
pandas 1.5.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.0.0
prompt_toolkit 3.0.36
psutil 5.9.4
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 3.0.9
pytz 2022.7.1
scipy 1.10.0
session_info 1.0.0
setuptools 65.6.3
six 1.16.0
sklearn 1.2.1
stack_data 0.6.2
texttable 1.6.7
threadpoolctl 3.1.0
tornado 6.2
traitlets 5.9.0
typing_extensions NA
wcwidth 0.2.6
yaml 6.0
zipp NA
zmq 25.0.0
zoneinfo NA
-----
IPython 8.9.0
jupyter_client 8.0.2
jupyter_core 5.2.0
-----
Python 3.9.16 (main, Jan 11 2023, 16:05:54) [GCC 11.2.0]
Linux-4.4.0-210-generic-x86_64-with-glibc2.23
-----
Session information updated at 2023-02-28 09:04
Can you provide your GPU details?
Your cuda version looks quite old. Are you sure faiss gpu and/or jax can see the gpu?
I am not sure, How to test whether faiss gpu and jax is used or not when executing the function faiss_brute_force_nn
Hello, I have encountered a similar issue. My CUDA version is 11.8. When running faiss_brute_force_nn, I can see GPU memory occupation with zero usage rate. It seems like faiss cannot correctly run on my side, while I have no clues to address this issue.
The overall pipeline is also much slower than I expected. After switching to pynndescent, I managed to finish the prepare step (knn building). While in benchmark step, it takes 5 hours to evaluate one model, which is much slower than reported in the tutorial. I suspect the running speed is correlated with the latent dimension. Could you please kindly specify the dimension of your adata.obsm["X_scVI"]?
These are about 20 dim, though I wouldn't expect something like 100dim to be much slower.
Can you print the output of
import jax
jax.devices()
?
Hello, I have encountered a similar issue. My CUDA version is 11.8. When running faiss_brute_force_nn, I can see GPU memory occupation with zero usage rate. It seems like faiss cannot correctly run on my side, while I have no clues to address this issue.
The overall pipeline is also much slower than I expected. After switching to pynndescent, I managed to finish the prepare step (knn building). While in benchmark step, it takes 5 hours to evaluate one model, which is much slower than reported in the tutorial. I suspect the running speed is correlated with the latent dimension. Could you please kindly specify the dimension of your adata.obsm["X_scVI"]?
I have the same issue as you. when I ran faiss_hnsw_nn
instead of faiss_brute_force_nn
, the speed of calculating KNN is very fast, but scib_metircs throws error when calculating metrics
These are about 20 dim, though I wouldn't expect something like 100dim to be much slower.
Can you print the output of
import jax jax.devices()?
I reinstall jax so that now it properly display my GPU information:
>>> jax.devices()
[StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=1, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=2, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=3, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=4, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=5, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=6, process_index=0, slice_index=0), StreamExecutorGpuDevice(id=7, process_index=0, slice_index=0)]
However, the GPU usage rate for the nearest neighbor algorithm is still zero.
Hello, I have encountered a similar issue. My CUDA version is 11.8. When running faiss_brute_force_nn, I can see GPU memory occupation with zero usage rate. It seems like faiss cannot correctly run on my side, while I have no clues to address this issue.
The overall pipeline is also much slower than I expected. After switching to pynndescent, I managed to finish the prepare step (knn building). While in benchmark step, it takes 5 hours to evaluate one model, which is much slower than reported in the tutorial. I suspect the running speed is correlated with the latent dimension. Could you please kindly specify the dimension of your adata.obsm["X_scVI"]?I have the same issue as you. when I ran
faiss_hnsw_nn
instead offaiss_brute_force_nn
, the speed of calculating KNN is very fast, but scib_metircs throws error when calculating metrics
Indeed. I just tried 'faiss_hnsw_nn' as well. It's fast since it's an ANN. However when calculating metrics, it throws errors including:
(1) Loaded runtime CuDNN library: 8.3.3 but source was compiled with: 8.6.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
(2) File /egr/research-dselab/wenhongz/miniconda3/envs/scib/lib/python3.9/site-packages/jax/_src/dispatch.py:1030, in backend_compile(backend, built_c, options, host_callbacks)
1025 return backend.compile(built_c, compile_options=options,
1026 host_callbacks=host_callbacks)
1027 # Some backends don't have host_callbacks
option yet
1028 # TODO(sharadmv): remove this fallback when all backends allow compile
1029 # to take in host_callbacks
-> 1030 return backend.compile(built_c, compile_options=options)
XlaRuntimeError: INTERNAL: RET_CHECK failure (external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc:627) dnn != nullptr
I am trying to resolve them.
====================Update==============
After resolve the JAX error, it now throws an error:
ValueError: Each cell must have the same number of neighbors.
I suppose the metric does not support ANNs like HNSW.
@wehos I really can't help with debugging faiss installation. I can say installing with conda worked for me and it's using the GPU.
I would like to focus this issue on the metrics potentially being slow.
@wehos can you share the dimension of the latent arrays? The time you see on the tutorial is from one RTX3090 gpu and latents were about 20dim
I opened up allowing custom nearest neighbors methods so that one could use any method. RAPIDS also has gpu accelerated nearest neighbors if you can manage to install it.
@wehos I really can't help with debugging faiss installation. I can say installing with conda worked for me and it's using the GPU.
I would like to focus this issue on the metrics potentially being slow.
@wehos can you share the dimension of the latent arrays? The time you see on the tutorial is from one RTX3090 gpu and latents were about 20dim
Hi @adamgayoso. I just reproduced the original tutorial. Though I still not being able to accelerate knn, the pynndescent takes 43 minutes and overall it runs 122 mins. Although it's still far more slower than the original result, I tend to believe this is due to the CPU performance difference.
Regarding my previous report (5 hours evaluating one model), it is probably due to the temporary congestion of server CPUs.
My apology for the rushed reporting.
I think it's great to receive the reports, I just want to understand fully what's happening :)
For knn, you can write a method that uses rapids as I linked above if you are having trouble with faiss.
Came across this due to the RAPIDS mention. RAPIDS cuML now provides experimental support for CPU execution for an initial set of estimators (including NearestNeighbors). You can install and prototype on a laptop or other machine without an NVIDIA GPU by installing the cuml-cpu
package and then use the same code when you have access to a GPU by installing the cuml
package. The cuML documentation now includes an example notebook and the v23.02 release blog has more information.
@adamgayoso Hey, we just ran into this problem as well using this fine package.
I noticed some potential issue in the faiss code provided in the tutorial notebook:
Both faiss kNN functions contain these three lines of code
gpu_index = faiss.index_cpu_to_gpu(res, 0, index)
index.add(X)
distances, indices = index.search(X, k)
To properly run this with GPU acceleration, shouldn't the second and third line also reference gpu_index
rather than index
?
At least for us this seems to fix the issue of no GPU utilization when using faiss_brute_force_nn()
and dramatically speeds up the computation.