YingfanWang/PaCMAP

`fit_transform` and `transform` on the same feature doesn't return the same value

Opened this issue · 9 comments

Hi, thanks for developing PaCMAP, lovely work!

I found that using transform after using fit_transform on the same set of features yields different results.

I ran the following example:

import pacmap
import numpy as np

np.random.seed(0)

init = "pca"  # results can be reproduced also with "random"

reducer = pacmap.PaCMAP(
    n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, save_tree=True
)

features = np.random.randn(100, 30)

reduced_features = reducer.fit_transform(features, init=init)
print(reduced_features[:10])

transformed_features = reducer.transform(features)
print(transformed_features[:10])

And returns

[[ 0.7728913   3.785831  ]
 [-0.69379026  2.116452  ]
 [-1.7770871  -0.97542125]
 [ 2.5090704   1.8718773 ]
 [-0.06890291 -2.2959301 ]
 [ 1.9657456   1.1580495 ]
 [ 1.0486693  -1.4648851 ]
 [-1.4896832   1.7203271 ]
 [ 0.54106015  2.38868   ]
 [ 3.0175838  -1.9216222 ]]

[[-0.03516154  2.543376  ]
 [-0.467008    1.6641414 ]
 [-0.44973713 -1.535601  ]
 [ 1.0218439   1.5691875 ]
 [-0.30733356 -2.3227684 ]
 [ 0.8294033   1.0432268 ]
 [ 0.10503205 -0.8651409 ]
 [-0.63982046  0.59202313]
 [ 0.38573623  1.5135498 ]
 [ 2.0508025  -1.5033388 ]]

I would expect the same results because the fit_transform should be the combination of fit and transform (regardless of the implementation details). This is what PCA in sklearn and UMAP do.

Is this an intended feature? And if the answer is No, what should we do? One possible solution I found is

reducer = reducer.fit(features, init=init)

# Now the following lines return the same feature.
reduced_features = reducer.transform(features)
transformed_features = reducer.transform(features)

But this only solves the problem at the implementation level, not at the conceptual level. Since the returned values from fit_transform and transform are different, I'm not sure I can trust the output of transform.

PS: this has nothing to do with the random seed, since I fixed the random seed, I can get the same result across runs.

Hi there! Thank you for using PaCMAP. The result is expected to be different, since in PaCMAP the transform() function treats the input as additional data points that is expanded to the original data. In the current version, transform() will try to place the new input to their nearest neighbors' low dimension embeddings. As a result, there is no guarantee on whether the same points will always be placed to the same place. This design choice allows the points to be differentiated. However, as we said in the README, this feature is not finalized and we welcome any feedbacks towards its design. Is there any reason you want two data points that has the same value to be placed at the same place?

Thanks for your fast reply! I think conceptually it makes more sense that identical incoming points should be projected to the same place as the old points. Users who used PCA or UMAP before (like me) would expect this behavior.

My specific case was to write a test on our software to check if fit_transform and transform produce the same results.
Since the outcome of this test is false, I disabled certain reproducibility behavior in our software for PaCMAP.

Full disclosure, I haven't read the PaCMAP paper, and not sure what I described here is doable. If this is not possible for PaCMAP to mirror the sklearn's fit_transform and transform, then I think it makes sense to place a big bold warning in both readme and documentation.

Is there any reason you want two data points that has the same value to be placed at the same place?

Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?

Thanks for your fast reply! I think conceptually it makes more sense that identical incoming points should be projected to the same place as the old points. Users who used PCA or UMAP before (like me) would expect this behavior.

My specific case was to write a test on our software to check if fit_transform and transform produce the same results. Since the outcome of this test is false, I disabled certain reproducibility behavior in our software for PaCMAP.

Full disclosure, I haven't read the PaCMAP paper, and not sure what I described here is doable. If this is not possible for PaCMAP to mirror the sklearn's fit_transform and transform, then I think it makes sense to place a big bold warning in both readme and documentation.

Thank you for your suggestion! A warning has been added to the method, and we will think about ways to improve the transform method.

@hyhuang00 Thanks for your effort, you can close this issue if you want.

Is there any reason you want two data points that has the same value to be placed at the same place?

Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?

Ensuring points that has very similar values in the high-dimensional space to locate at close but different places in the low-dimensional space is useful when it comes to visualization. It helps the embedding to avoid the so-called "crowding problem" during the optimization, and sometimes it helps our users to know that there are multiple points exhibiting at the same place, forming a cluster. This might be less helpful when the embedding is used for other purposes. Perhaps we can make an option to allow different behavior.

Why would you expect any other behaviour from a dimensionality reduction technique? Could you suggest a use case where you don't want this to happen, i.e. where this not happening is useful?

Ensuring points that has very similar values in the high-dimensional space to locate at close but different places in the low-dimensional space is useful when it comes to visualization.

Very true, but 'very similar values' and 'the same value' are two different use cases.

TCWO commented

Hi there, I am trying to fit a model with a smaller set and the apply the transform to a bigger set but I encountered this error which I assume is about the generating the neighbors. Can you let me know how I can handle it?

AssertionError Traceback (most recent call last)
/tmp/ipykernel_623958/2284593526.py in
----> 1 data_all_dr, t_all_dr = DimRed2(data_sampl, data_norm, method = dr, dims=dims)

/tmp/ipykernel_623958/736146467.py in DimRed2(df1, df2, method, dims, pca)
84
85 # Now, use the fitted model to transform a larger dataset (X_large)
---> 86 dr = embedding.transform(X2, init='pca', save_pairs=False)
87
88 end = time.time()

~/.local/lib/python3.8/site-packages/pacmap/pacmap.py in transform(self, X, basis, init, save_pairs)
932 self.apply_pca, self.verbose)
933 # Sample pairs
--> 934 self.pair_XP = generate_extra_pair_basis(basis, X,
935 self.n_neighbors,
936 self.tree,

~/.local/lib/python3.8/site-packages/pacmap/pacmap.py in generate_extra_pair_basis(basis, X, n_neighbors, tree, distance, verbose)
397 npr, dimp = X.shape
398
--> 399 assert (basis is not None or tree is not None), "If the annoyindex is not cached, the original dataset must be provided."
400
401 # Build the tree again if not cached

AssertionError: If the annoyindex is not cached, the original dataset must be provided.

and here is my function X is the smaller set and X2 the big dataset:
elif method == 'PaCMAP':
#slightly different since we need to transform the dataframe to an array as an input for the pacmap function
start = time.time()
X = data
X = np.asarray(X)
X = X.reshape(X.shape[0], -1)
X2 = data2
X2 = np.asarray(X2)
X2 = X2.reshape(X2.shape[0], -1)
# Setting n_neighbors to "None" leads to a default choice
embedding = pacmap.PaCMAP(n_components=dims, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
# fit the data (The index of transformed data corresponds to the index of the original data)
#embedding.fit(X, init="pca")
#dr = embedding.transform(X2)

        # Fit and transform using a smaller dataset (X_small)
        embedding_small = embedding.fit_transform(X, init='pca', save_pairs=True)

        # Now, use the fitted model to transform a larger dataset (X_large)
        dr = embedding.transform(X2, init='pca', save_pairs=False)
        
        end = time.time()
        t = end-start

Hello, thank you for PACMAP, beautiful work.

I second this question. I am reaching:

AssertionError: If the annoyindex is not cached, the original dataset must be provided.

when i call the transform method on a new dataset after it has already been fit on a previous one. It is desirable to be able to transform new data into an existing embedding space.
Can you provide some guidance on this?

EDIT: this was due to the fact that I had not specified save_tree = True. Might be good to spell that out a bit more clearly in the documentation! Thank you :)