YingfanWang/PaCMAP

PaCMAP is stochastic with sparse data even if random seed is set

Closed this issue · 7 comments

Hi all,

I noticed that (the current version) of PaCMAP seems to give stochastic results on sparse binary data even if random seed is set. Below is an example of code with randomly generated data -- in the first case, the data is sparse and the result is stochastic on my machine; in the second case the data is not sparse and the result is deterministic.

I have not given deeper thought to whether it is reasonable to run PaCMAP on sparse data, but it would still be nice if it was deterministic when random seed is set.

Could you have a look into this?
Many thanks!

import pacmap
import numpy as np

###----
### Stochastic when there's sparsity
###----
print('\n\nBinary data with high proportion of zeros')

# Create a binary dataset with low frequency of 1s
rng = np.random.default_rng(seed=42)
d = rng.binomial(n=1,p=0.01,size=(1000,100))

# Just in case - if any columns are all zeros, remove these
mask = d.sum(axis=0) > 0
d = d[:,mask]
print('Shape of data: \n{}'.format(d.shape))
print('Number of nonzero elements in each column: \n{}'.format(d.sum(axis=0)))
print('Snapshot of data: \n{}'.format(d[:10,:10]))

# Pacmap
e = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t = e.fit_transform(d, init="pca")

e2 = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t2 = e2.fit_transform(d, init="pca")

# Test
print(t[:3, :3])
print(t2[:3, :3])
test = np.sum(np.abs(t-t2)) < 1e-8
print('Difference between values of two runs is less than 1e-8? {}'.format(test))
test2 = (t == t2).all()
print('Values between two runs are equal? {}'.format(test2))

###----
### Deterministic when there's less sparsity
###----
print('\n\nBinary data with roughly equal proportion of zeros and ones')

# Create a binary dataset with low frequency of 1s
rng = np.random.default_rng(seed=42)
d = rng.binomial(n=1,p=0.5,size=(1000,100))

# Just in case - if any columns are all zeros, remove these
mask = d.sum(axis=0) > 0
d = d[:,mask]
print('Shape of data: \n{}'.format(d.shape))
print('Number of nonzero elements in each column: \n{}'.format(d.sum(axis=0)))
print('Snapshot of data: \n{}'.format(d[:10,:10]))

# Pacmap
e = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t = e.fit_transform(d, init="pca")

e2 = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t2 = e2.fit_transform(d, init="pca")

# Test
print(t[:3, :3])
print(t2[:3, :3])
test = np.sum(np.abs(t-t2)) < 1e-8
print('Difference between values of two runs is less than 1e-8? {}'.format(test))
test2 = (t == t2).all()
print('Values between two runs are equal? {}'.format(test2))

##
print('PaCMAP version: {}, numpy version: {}'.format(pacmap.__version__, np.__version__))

The output I get is:

Binary data with high proportion of zeros
Shape of data: 
(1000, 100)
Number of nonzero elements in each column: 
[12 16 11 11 11 14  7  6  8 11  8 10 11  9  8 13 10 11  5 10  9 11  9  9
 11 21 14 11 13 12 15  8 11  7 10 11  6  8  6 10  7  9  9 12  6 15  5  9
  8  7  9  8 11  9 15  5  6  8 11 14 12 13  7  9  8  6 12  8  6 12 12 14
 17  7 14 12 12  8 10 13  6 13 15 12  6  7 10 13  9 13  8  7  7 17  9 12
  9 10 15  7]
Snapshot of data: 
[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0]]
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
  warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
  warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
[[ 0.8841729   2.0089896 ]
 [-0.5350942   0.95636046]
 [ 1.7903003   2.135278  ]]
[[ 0.73095715 -2.7870746 ]
 [ 0.6337722  -0.6057222 ]
 [ 1.952701   -2.0199978 ]]
Difference between values of two runs is less than 1e-8? False
Values between two runs are equal? False


Binary data with roughly equal proportion of zeros and ones
Shape of data: 
(1000, 100)
Number of nonzero elements in each column: 
[506 519 499 509 499 526 520 492 500 511 520 512 510 526 532 489 503 496
 502 487 491 508 509 518 496 467 488 510 498 516 488 520 516 509 495 511
 485 498 528 520 514 472 481 522 488 513 491 520 499 507 468 513 515 510
 493 489 496 508 480 523 494 518 489 491 514 507 489 507 489 520 501 487
 502 505 498 476 503 517 507 474 502 498 513 492 518 493 502 491 511 494
 504 508 472 500 505 534 478 497 533 493]
Snapshot of data: 
[[1 0 1 1 0 1 1 1 0 0]
 [1 1 0 1 1 1 0 0 0 1]
 [1 1 1 0 0 0 0 1 0 1]
 [1 0 1 1 1 0 0 1 0 0]
 [0 1 1 0 1 1 0 1 0 1]
 [1 1 1 1 1 1 0 1 0 0]
 [0 1 1 1 1 1 1 0 1 0]
 [0 0 0 1 0 1 1 1 0 0]
 [1 0 1 1 0 0 1 0 1 1]
 [0 0 0 0 1 1 0 0 0 0]]
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
  warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
  warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
[[ 1.0503646 -2.2979746]
 [ 1.6191299  2.3795164]
 [-3.1472695 -1.3276936]]
[[ 1.0503646 -2.2979746]
 [ 1.6191299  2.3795164]
 [-3.1472695 -1.3276936]]
Difference between values of two runs is less than 1e-8? True
Values between two runs are equal? True
PaCMAP version: 0.6.3, numpy version: 1.21.1

Just an additional note: if I set the size of my random sparse dataset to (10000,1000), I get a deterministic result. But if I then increase the sparsity by setting p=0.001, I still get a stochastic result (i.e. setting d = rng.binomial(n=1,p=0.001,size=(10000,1000)))

Thank you for bringing this up -- this issue is very well written! In release notes of 0.5.0, we have noticed such edge case that the user may get slightly different results even if the random_state parameter is set to be the same, due to some numba parallelization issue. It seems like this is not the case here in the first snippet. I will try to replicate the problem locally and look into this.

Thank you! It would be interesting to know what the solution will be. I hope the issue is reproducible (and not a typo in my code). Please let me know if I can be of help somehow.

Out of curiosity, I also ran the first part of my code 10 times and computed the correlation between embeddings from two runs. They were similar, but sometimes quite different -- [0.68, 0.75, 0.85, -0.15, 0.69, 0.81, 0.71, 0.49, 0.73, 0.81].

… due to some numba parallelization issue.

I think it would be useful to be able to turn this parallelisation off, so that the output is deterministic with the same random seed.

Thank you! It would be interesting to know what the solution will be. I hope the issue is reproducible (and not a typo in my code). Please let me know if I can be of help somehow.

Out of curiosity, I also ran the first part of my code 10 times and computed the correlation between embeddings from two runs. They were similar, but sometimes quite different -- [0.68, 0.75, 0.85, -0.15, 0.69, 0.81, 0.71, 0.49, 0.73, 0.81].

My sincere apologies for this really long overdue reply. The issue is reproducible and it's not because of a typo in your code. It seems like in the presented case, there are many (in my case, 358 out of 1000) rows that are completely the same (fully zero), which leads to unstable results when PaCMAP is sorting the input to sample the neighbor pairs. The np.argsort() seems to unstable results regarding this case. This issue is non-trivial, and there will be changes on the algorithm level that needs to be made to handle this problem.

Hi @hyhuang00, thank you for looking into this! I found a workaround that I believe is satisfactory for me -- I replaced all zeros with small random numbers (in [-2e-5, 2e-5]), which then led to stable sorting.
Thanks!

Thank you for sharing your solution! Closing the issue for now.