YingfanWang/PaCMAP

Storing PaCMAP on DB?

Opened this issue · 5 comments

Hello.
I'm trying to store the PaCMAP model in a db for further transformations. I tried to pickle, but the tree is an annoy.annoy object.
Also tried to save the annoy.annoy object with embedding.tree.save('./annoy_object.ann'), this works but I cannot load, since creating the PaCMAP do not initialize the annoy.annoy tree.
Is there a way to save/load PaCMAP object or tree? My main objective is to send it to a DB, so I can transform new incoming data in my clustering pipeline.

Thanks for your attention.

Have you tried to directly load the annoy instance? It could be done using something like this:

embedding = pacmap.PaCMAP() # initialize/load the saved pacmap instance
embedding.tree = load_annoy_tree() # your function that loads the annoy instance

Hello! I did tried what you suggested, and even completed the others attributes that the method required to run:

u = AnnoyIndex(0)
u.load('test.ann')
embedding.tree  = u
embedding.xmin = emb_model.xmin
embedding.xmax = emb_model.xmax
embedding.xmean = emb_model.xmean
embedding.tsvd_transformer = emb_model.tsvd_transformer
embedding.pair_FP = emb_model.pair_FP
embedding.pair_MN = emb_model.pair_MN
embedding.pair_neighbors = emb_model.pair_neighbors
embedding.n_neighbors = emb_model.n_neighbors
embedding.transform(feature_matrix_c)

But I still get:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_26194/902635302.py in <module>
----> 1 embedding.transform(feature_matrix_c)

/opt/conda/lib/python3.9/site-packages/pacmap/pacmap.py in transform(self, X, basis, init, save_pairs)
    932                                      self.apply_pca, self.verbose)
    933         # Sample pairs
--> 934         self.pair_XP = generate_extra_pair_basis(basis, X,
    935                                                  self.n_neighbors,
    936                                                  self.tree,

/opt/conda/lib/python3.9/site-packages/pacmap/pacmap.py in generate_extra_pair_basis(basis, X, n_neighbors, tree, distance, verbose)
    417 
    418     for i in range(npr):
--> 419         nbrs[i, :], knn_distances[i, :] = tree.get_nns_by_vector(
    420             X[i, :], n_neighbors_extra, include_distances=True)
    421 

IndexError: Vector has wrong length (expected 0, got 17)

Seems like the problem is in your initialization of the AnnoyIndex. It seems like the number of dimensions you are using is 17, therefore for loading the annoy index, you should initialize it with u = AnnoyIndex(17) instead of u = AnnoyIndex(0).

For some reason I cannot load the saved PaCMAP with index 17. I have to load with index 18, but this crashes the transform function. Idk if this is a PaCMAP problem or annoy index problem.
But it would be nice to have a PaCMAP function to correctly save and load its models.

I see. We will work on that feature.