scverse/rapids_singlecell

Reproducibility

Closed this issue · 7 comments

grst commented

Describe the bug
Hi @Intron7,

many thanks for putting together this package! The speedup is really mindblowing for larger datasets and makes my day-to-day analyses a lot more fun!

I know that reproducibility for this kind of algorithms is hard, and there's probably no way to get reproducibility between different devices, driver versions etc. but I think it should be possible to run the same code in the same environment on the same machine and get exactly the same results for all algorithms.

I tested this for some example data and found the following

reproducible function
rsc.tl.harmony
✔️ rsc.tl.neighbors
rsc.tl.umap
✔️ rsc.tl.leiden

Are there maybe any additional seeds that need to be set to make this work consistently?

Steps/Code to reproduce bug

import rapids_singlecell as rsc
import scanpy as sc
import pooch
import numpy.testing as npt
import anndata as ad

# Get example data and build AnnData object with two full samples
# Source: https://scverse-tutorials.readthedocs.io/en/latest/notebooks/basic-scrna-tutorial.html
EXAMPLE_DATA = pooch.create(
    path=pooch.os_cache("scverse_tutorials"),
    base_url="doi:10.6084/m9.figshare.22716739.v1/",
)
EXAMPLE_DATA.load_registry_from_doi()
samples = {
    "s1d1": "s1d1_filtered_feature_bc_matrix.h5",
    "s1d3": "s1d3_filtered_feature_bc_matrix.h5",
}
adatas = {}

for sample_id, filename in samples.items():
    path = EXAMPLE_DATA.fetch(filename)
    sample_adata = sc.read_10x_h5(path)
    sample_adata.var_names_make_unique()
    adatas[sample_id] = sample_adata

adata = ad.concat(adatas, label="sample")
adata.obs_names_make_unique()

# basic preprocessing in scanpy
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.tl.pca(adata)
adata2 = adata.copy()

for ad in [adata, adata2]:
    rsc.tl.harmony_integrate(ad, key="sample")
    # don't use harmony -- we want to know if neighbors gives different result, even given the same PCA
    rsc.tl.neighbors(ad, use_rep="X_pca")
    rsc.tl.umap(ad)
    rsc.tl.leiden(ad)

npt.assert_array_equal(adata.obsm["X_pca_harmony"], adata2.obsm["X_pca_harmony"])
npt.assert_array_equal(adata.obsp["connectivities"].data, adata2.obsp["connectivities"].data)
npt.assert_array_equal(adata.obsm["X_umap"], adata2.obsm["X_umap"])
npt.assert_array_equal(adata.obs["leiden"].values, adata.obs["leiden"].values)

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Linux Distro/Architecture: Scientific Linux 7 (3.10.0-1160.11.1.el7.x86_64)
  • GPU Model/Driver: Tesla V100-SXM2-32GB, 470.129.06
  • CUDA: 11.4
  • Method of Rapids install: conda, tried both the 23.04 and 23.06 yml files provided in this repo.

Hi Gregor,

Glad to hear you like RSC. For UMAP I have bad news for you. This is what cuML says about it

random_state is the seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Note: Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floating-point additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of potentially slower training and increased memory usage.

For Harmony I'll take a look and see if I can improve reproducibility. I think for this one it might be possible.

Thank you for providing the test code.

Thank you for pointing lack of reproducibility. As it turns out some of the setup ups I used for the random_state was faulty. I'll fix this in the anndata_PR #60.

100% reproducibility will still not happen due to the async nature of GPU-computing and floats being floats.

With 32-bit floats i got the assertion error too:

Mismatched elements: 113906 / 856250 (13.3%)
Max absolute difference: 8.106232e-06
Max relative difference: 0.00205672

With 64-bit floats the error is:

Mismatched elements: 850179 / 856250 (99.3%)
Max absolute difference: 4.16378043e-12
Max relative difference: 1.41103855e-08

Now the question is how to move on from here. Would you like the option in harmony_integrate to use fp64?

grst commented

Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of potentially slower training and increased memory usage

This doesn't sound too bad if they are referring to three digits of precision of the output coordinates. Or am I misunderstanding this?

100% reproducibility will still not happen due to the async nature of GPU-computing and floats being floats.

The new results seem pretty similar indeed. I wonder if they still impact the neighborhood graph and the leiden clustering, as this is what matters to me (manually annotating clusters is even more painful if the cluster labels change all the time).
Since the max abs. difference is in the e-6 also for float32 I also wonder what happens if we just round the values to 5 digits?

So if you use leiden clustering I would in general advice to use rapids-23.04. In 23.06 onwards it's beyond broken #44.

Since the max abs. difference is in the e-6 also for float32 I also wonder what happens if we just round the values to 5 digits?

I will have a look at the Impacted this has on the neighbourhood search and clustering.

This doesn't sound too bad if they are referring to three digits of precision of the output coordinates. Or am I misunderstanding this?

I set the default random_state of the umap to 0. So this should already apply in theory. I can try changing it to 42 or something else.

grst commented

I set the default random_state of the umap to 0. So this should already apply in theory. I can try changing it to 42 or something else.

I see! I doubt changing the seed makes a difference. It's then either their bug or a bit unfortunate wording of the doc section you shared and it's just not possible.

Here's what it looks like in practice:
image
image

Probably not equal to 3 significant digits, but given that it's UMAP and the clustering is reproducible I don't even care that much.

Ok so with the next release of rapids-singlecell v0.9.0 the rsc.pp.harmony_integrate will be reproducible for neighbors and clustering. I switched the calculations to 64 bit.

Most calculations were anyway done with 64 bit floats so it won't affect performance and memory usage too much.

Hello,

I don't want to reopen this issue. I just wanted to double check if the current rapids version allows full reproducibility or not, because to my experience so far (speaking of rerunning the same code on the same system twice) is that I can't reproduce my UMAPs.

I also wanted to add in terms of harmony, the inconsistencies started after version 0.0.6, so I just install the 0.0.6 version for peace of mind in my other pipelines. Not sure if rapids could benefit from that as well.

Best,
Anastasia