Results not reproducible

Looks like the recent updated to KMeans method made the results not reproducible. Please see comment here
9c081b8#commitcomment-98243211

I've tried to set random_state, but still can't get the exact same results by KMeans. Doesn't seem have this irreproducibility issue using the original kmeans2 method. Maybe you could give users an option to run harmony using either method? depending on whether the user needs to get 100% reproducibility or needs to compute more efficiently

#20

Hi Hurley, thanks for opening the issue.

I agree that random_state should be settable by the users. That's an important feature to add.

However, I am surprised by your report that random_state does not control the reproducibility. Have you considered reporting this issue to the Kmeans developers? Or do you think this issue is caused by the code in harmonypy?

Could I please ask if you might consider sharing a small code example to demonstrate the issue? Maybe we might learn something?

Otherwise, I also agree with you that we should consider using the fast Kmeans as a default option and then give the user the freedom to switch to a slower Kmeans if they want better control over reproducibility.

It's a known issue for KMeans. Here are some of simple testings

from sklearn.cluster import KMeans
from scipy.cluster.vq import kmeans2
np.random.seed(10)

df = pd.DataFrame(np.random.randn(10000, 30))

KMeans method

## Run #1
model = KMeans(n_clusters=20, init='k-means++',
                       n_init=10, max_iter=25,
                       random_state = 10)
model.fit(df)

km_centroids_1, km_labels_1 = model.cluster_centers_, model.labels_

## Run #2
model2 = KMeans(n_clusters=20, init='k-means++',
                       n_init=10, max_iter=25,
                       random_state = 10)
model2.fit(df)

km_centroids_2, km_labels_2 = model2.cluster_centers_, model2.labels_

# are they the same?
np.array_equal(km_centroids_1, km_centroids_2)

False

km_centroids_1[0] == km_centroids_2[0]

array([False, False, False, True, False, False, False, False, False,
True, False, False, True, False, False, False, False, False,
True, True, True, False, False, False, True, True, True,
False, False, False])

# some examples:
print(km_centroids_1[0,0])
print(km_centroids_2[0,0])
print(km_centroids_1[0,1])
print(km_centroids_2[0,1])
print(km_centroids_1[0,2])
print(km_centroids_2[0,2])

-0.13421938109443926
-0.1342193810944393
-0.5070563285868596
-0.5070563285868595
-0.15914531661365244
-0.15914531661365242

kmeans2 method

km_centroids_1, km_labels_1 = kmeans2(df, 10, minit='++', seed = 10)
km_centroids_2, km_labels_2 = kmeans2(df, 10, minit='++', seed = 10)

# are they the same?
np.array_equal(km_centroids_1, km_centroids_2)

True

km_centroids_1[0] == km_centroids_2[0]

array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True])

# some examples:
print(km_centroids_1[0,0])
print(km_centroids_2[0,0])
print(km_centroids_1[0,1])
print(km_centroids_2[0,1])
print(km_centroids_1[0,2])
print(km_centroids_2[0,2])

0.3156416453178541
0.3156416453178541
-0.1583557075910181
-0.1583557075910181
0.6063544132513228
0.6063544132513228

I tried different precisions (float32, float64), somehow KMeans just fail on that last digit for some results ..... Seems small issues if you only look at the numbers; but once you use the harmony results to generate downstream knn and UMAP for single-cell, the UMAP and leiden clustering looks very different between different runs.

Here is the Kmeans random seed issue reported at scikit-learn: scikit-learn/scikit-learn#23886

@johnarevalo Hey John, see my previous comment for a link to an issue on the scikit-learn repo where we can see that @hurleyLi is technically correct. The Kmeans function has almost — but not exactly — zero fluctuations in the output when we use the same random seed. This is an issue inside the scikit-learn code, not harmonypy.

Anyway, a real world analysis involves multiple stochastic algorithms (e.g. Kmeans, UMAP, etc.), so I would not worry about tiny (1e-16) fluctuations in any analysis pipeline.

Anyway, a real world analysis involves multiple stochastic algorithms (e.g. Kmeans, UMAP, etc.), so I would not worry about tiny (1e-16) fluctuations in any analysis pipeline.

Totally agree. Either KMeans or kmeans2 will suffice the analysis needs -- most of the time it doesn't matter. It's just sometimes we're asked to 100% reproduce the UMAP figures or results coming out of the pipeline, so we have to use the kmeans2 method in that case.

@johnarevalo per your question in #26 , if you set the relative tolerance from np.testing.assert_allclose() to 0, you'll see that it will fail the test. But again, it only matters when you look for 100% reproducibility. All in all, I think it's important to keep an option for the user who need 100% reproducibility.

Thank you both for your explanations!

I think is inaccurate to conclude that, after setting the seed, the corrected vectors are not reproducible. The default tolerance of assert_allclose is 1e-7, so from the harmonypy perspective, the results are reproducible IMO.

In this case, the major issue is the UMAP sensitivity to such tiny fluctuations. Have you considered exploring other implementations, or perhaps ugly hacks such as np.flooring values before feeding the umap library, or maybe using other 2d projections?

All in all, I'd be hesitant to add kmeans2 as an option just to deal with numerical precisions at this level. This will only solve exact reproducibility for same environment. Across different platforms, architectures, libraries' versions, etc, it's very likely that you will face similar fluctuations.

All in all, I'd be hesitant to add kmeans2 as an option just to deal with numerical precisions at this level. This will only solve exact reproducibility for same environment. Across different platforms, architectures, libraries' versions, etc, it's very likely that you will face similar fluctuations.

Not sure why you're against being flexible and adding an alternative option for users who seek reproducibility... In our production pipeline, the compute environment, e.g. platform, libraries, etc, are strictly controlled and versioned; so, no, I don't have other similar fluctuations...

Anyway, hope this post help guide future users to figure out why their single-cell clustering and UMAP looks different from different runs. But the PR #26 does provide a solution to fix this issue for your consideration @slowkow . Please feel free to close the issue if no further comment. Thanks

By facing similar fluctuations, I didn't mean in this specific use case, but to any harmonypy user. My reasons to not include this param as-is are:

Setting kmeans_method="scipy_kmeans2" does not guarantee the intended floating-point precision required in this issue. As Hurley mention, it also requires that the environment be strictly controlled.
Such precision is not the scope of harmonypy library. There are other efforts (e.g. pycrlibm, rlibm) addressing this specific task.

I suggest to modify the PR to read, instead of the kmeans_method string, annp.array of precomputed centroids. This allows to use centroids from scipy kmeans2, and also will add support to other initialization methods without coupling them to the library.

Having the 'kmeans_method' param been added to the new version of harmonypy? If added, please tell me the version number, thank you

Hi @JianjGit, why don't you try the latest version?

Hi everyone,

Thank you all for this very useful discussion, it really explained the UMAP fluctuations I was seeing despite setting the random state on harmonypy. I am also working with an extremely controlled environment and for reproducibility purposes for our manuscript submission I would like to have my UMAPs being identical if the pipeline is to be rerun. I am using the latest version of harmonypy, but I have struggled to find exactly how to use kmeans2. Could you please provide a working code example to try it on my end?

Cheers!
Anastasia

Check the test that uses a custom clustering function:

harmonypy/tests/test_harmony.py

Line 58 in 182a5c6

def test_cluster_fn():

Thank you! 🙏

@kousaa To save you some time, if you're looking for identical UMAP like what I did, the 9c081b8 commit made it impossible. It changes the clustering method from scipy.cluster.vq.kmeans2 to sklearn.cluster.KMeans. The author didn't want to include the original kmeans2 method as an alternative option for whatever reason. So I had to make a customized version of the package that include the original kmeans2 method as an option.
Also in test_harmony.py, if you set the relative tolerance from np.testing.assert_allclose() to 0, you'll see that the program will fail the test...
Good luck!
Hurley

The test is an example of how to use scipy.cluster.vq.kmeans2 instead of sklearn.cluster.KMeans:

harmonypy/tests/test_harmony.py

Line 66 in 182a5c6

centroid, label = kmeans2(data, K, minit='++', seed=0)

The assertion is not using assert_allclose but assert_equal

harmonypy/tests/test_harmony.py

Line 78 in 182a5c6

np.testing.assert_equal(run(cluster_fn), run(cluster_fn))

Thank you both for the help! I will try to give the test_harmony.py a go, or alternatively I thought of just going back to a previous version before that fix. I am using harmony 0.06 version in another environment of mine and that remains reproducible. Just another option I guess. Thanks again, Anastasia