YingfanWang/PaCMAP

Stochasticity found after setting a random seed

ian425 opened this issue · 9 comments

Hello, we are working with PaCMAP and found stochasticity when testing with one of our datasets. We used init = 'PCA' and we also kept 'apply_pca=True' . When we ran the code the first time we got 5 clusters (setting a random seed of 20). We then ran the code again without changing any parameter, with the same random seed and we got 4 clusters.
Is this stochasticity something that should be expected, even with setting a specific random state?

Thanks.

No, this is not. You should be able to receive the exact same embedding if the random seed is specified. Could you provide more details about the experiment, including the version of PaCMAP you are using? I'll look into this right away.

Thank you. We are using the latest version of PaCMAP. Let me discuss with my team to see what more details I could mention.

Thank you for your patience. We are using pacmap==0.5.3.
The following are the parameters.
embedding = pacmap.PaCMAP(n_dims = 2, n_neighbors = 10, lr = 1, random_state = 20, apply_pca = True).fit_transform(X, init="pca")
The dimensions of input X is around 2000x4000

I see. I will try to replicate the error using a randomly generated matrix of the same shape.

Hmmm, it's so weird. On my end, the output is deterministic. I used the following script for the test:

`import pacmap
import numpy as np
import matplotlib.pyplot as plt

Initialize

pacmap.PaCMAP()

print

print(pacmap.PaCMAP())
np.random.seed(0) # Removing this does not make a difference
sample_data = np.random.normal(size=(2000, 4000))
instance1 = pacmap.PaCMAP(n_dims = 2, n_neighbors = 10, lr = 1, random_state = 20, apply_pca = True)
instance1_out = instance1.fit_transform(sample_data, init="pca")
instance2 = pacmap.PaCMAP(n_dims = 2, n_neighbors = 10, lr = 1, random_state = 20, apply_pca = True)
instance2_out = instance2.fit_transform(sample_data)
print('Experiment finished successfully.')

print(instance1_out[:3, :3])
print(instance2_out[:3, :3])

try:
assert(np.sum(np.abs(instance1_out-instance2_out))<1e-8)
print("The output is deterministic.")
except AssertionError:
print("The output is not deterministic.")
try:
assert(np.sum(np.abs(instance1.pair_FP.astype(int)-instance2.pair_FP.astype(int)))<1e-8)
assert(np.sum(np.abs(instance1.pair_MN.astype(int)-instance2.pair_MN.astype(int)))<1e-8)
except AssertionError:
print('The pairs are not deterministic')
for i in range(5000):
if np.sum(np.abs(instance1.pair_FP[i] - instance2.pair_FP[i])) > 1e-8:
print("FP")
print(i)
print(instance1.pair_FP[i])
print(instance1.pair_FP[i])
break
for i in range(5000):
if np.sum(np.abs(instance1.pair_MN[i] - instance2.pair_MN[i])) > 1e-8:
print('MN')
print(i)
print(instance1.pair_MN[i])
print(instance2.pair_MN[i])
break
`

I ran this script several times and it always gives me the same output. I guess the problem could be in the version of related packages. I'm using numpy==1.20.3, numba=0.53.1. Could you try to run the script I provided and see the output? You should be able to see something like

PaCMAP(random_state=0) Experiment finished successfully. [[-1.0776378 -0.87773395] [ 1.8569145 2.5031502 ] [ 3.54952 2.6133642 ]] [[-1.0776378 -0.87773395] [ 1.8569145 2.5031502 ] [ 3.54952 2.6133642 ]] The output is deterministic.

Hi, I ran the code you provided and got similar results for instance 1 and instance 2 resulting in being deterministic. But i tried adding multiple instances as shown here:

`import pacmap
import numpy as np
import matplotlib.pyplot as plt

pacmap.PaCMAP()
print(pacmap.PaCMAP())
np.random.seed(0) # Removing this does not make a difference
sample_data = np.random.normal(size=(2000, 4000))
instance1 = pacmap.PaCMAP(n_dims = 2, n_neighbors = 10, lr = 1, random_state =20, apply_pca = True)
instance1_out = instance1.fit_transform(sample_data, init="pca")
instance2 = pacmap.PaCMAP(n_dims = 2, n_neighbors = 10, lr = 1, random_state =20, apply_pca = True)
instance2_out = instance2.fit_transform(sample_data, init="pca")
instance3 = pacmap.PaCMAP(n_dims = 2, n_neighbors = 10, lr = 1, random_state =20, apply_pca = True)
instance3_out = instance2.fit_transform(sample_data, init="pca")
instance4 = pacmap.PaCMAP(n_dims = 2, n_neighbors = 10, lr = 1, random_state =20, apply_pca = True)
instance4_out = instance2.fit_transform(sample_data, init="pca")
print('Experiment finished successfully.')

print(instance1_out[:3, :3])
print(instance2_out[:3, :3])

try:
assert(np.sum(np.abs(instance1_out-instance2_out))<1e-8)
print("The output is deterministic.")
except AssertionError:
print("The output is not deterministic.")
try:
assert(np.sum(np.abs(instance1.pair_FP.astype(int)-instance2.pair_FP.astype(int)))<1e-8)
assert(np.sum(np.abs(instance1.pair_MN.astype(int)-instance2.pair_MN.astype(int)))<1e-8)
except AssertionError:
print('The pairs are not deterministic')

print(instance3_out[:3, :3])
print(instance4_out[:3, :3])

try:
assert(np.sum(np.abs(instance3_out-instance4_out))<1e-8)
print("The output is deterministic.")
except AssertionError:
print("The output is not deterministic.")
try:
assert(np.sum(np.abs(instance3.pair_FP.astype(int)-instance4.pair_FP.astype(int)))<1e-8)
assert(np.sum(np.abs(instance3.pair_MN.astype(int)-instance4.pair_MN.astype(int)))<1e-8)
except AssertionError:
print('The pairs are not deterministic')

for i in range(5000):
if np.sum(np.abs(instance1.pair_FP[i] - instance2.pair_FP[i])) > 1e-8:
print("FP")
print(i)
print(instance1.pair_FP[i])
print(instance2.pair_FP[i])
break
for i in range(5000):
if np.sum(np.abs(instance1.pair_MN[i] - instance2.pair_MN[i])) > 1e-8:
print('MN')
print(i)
print(instance1.pair_MN[i])
print(instance2.pair_MN[i])
break`

And got a result of :
` Experiment finished successfully.
[[-1.3624908 -0.8331049]
[ 1.7766521 -1.0958917]
[-2.8058937 2.7767875]]

[[-1.3624908 -0.8331049]
[ 1.7766521 -1.0958917]
[-2.8058937 2.7767875]]
The output is deterministic.

[[-1.0923495 2.3987863]
[-0.521751 -2.8778036]
[ 2.1016634 -2.5270624]]

[[-1.0923495 2.3987863]
[-0.521751 -2.8778036]
[ 2.1016634 -2.5270624]]
The output is deterministic.`

As shown instance 1 and instance 2 are deterministic. Instance 3 and instance 4 are deterministic. But instances 1 or 2 are not deterministic to instances 3 or 4 (Shouldn't all be deterministic if they are given the same seed?)

Seems like you are reusing instance 2 here for the output of instance 3 and 4, as shown in the line:

instance3_out = instance2.fit_transform(sample_data, init="pca")

The weird behavior here could be related to reusing some of the instances. I will try to run your code on my end to see the output behavior. Did you experience the problem you reported when you were trying to reuse some instances in your experiment?

Yes, thank you for correcting me. You are right after running it again they all turned out to be deterministic

`Experiment finished successfully.
[[-1.3624908 -0.8331049]
[ 1.7766521 -1.0958917]
[-2.8058937 2.7767875]]

[[-1.3624908 -0.8331049]
[ 1.7766521 -1.0958917]
[-2.8058937 2.7767875]]

The output is deterministic.
[[-1.3624908 -0.8331049]
[ 1.7766521 -1.0958917]
[-2.8058937 2.7767875]]

[[-1.3624908 -0.8331049]
[ 1.7766521 -1.0958917]
[-2.8058937 2.7767875]]

The output is deterministic.

I see. Does this solve your problem completely? Can I close this issue?