init with user specified matrix

Question

init with user specified matrix

Opened this issue 9 months ago · 6 comments

LJK1991 commented 9 months ago

Hello,

I have a question regardin the initializing with a user specified matrix.
Currently i am creating a pipeline that takes multiple single cell samples and i am correting for batches using the R batchelor which uses multibatchPCA and MNN for batch correction.
Currently i am still using a single dataset to setup the pipeline. However, i run into trouble trying to initialize with my PCA matrix

first i load my dataset (logcounts)

X = pd.read_csv("/home/l.kuijpers/data/public/10.1101-2020.01.26.919753/bone_marrow/results/dimensional_reduction/hvgs_filt_sce_features3000.txt.gz",sep="\t", index_col=0,header=0).to_numpy()
print(X)
print(X.shape)

[[4.71391225 4.47825845 4.31347248 ... 1.07156968 0.         1.22630936]
 [5.04615844 2.31952633 1.81465286 ... 0.         0.         0.        ]
 [3.96012334 4.78718275 4.84273991 ... 5.99543869 6.08122726 5.58161319]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.97886969 0.         1.17560436 ... 1.67962092 0.         1.22630936]
 [0.         1.22067729 0.         ... 0.         0.         0.        ]]
(3000, 8349)

and subsequently my PCA matrix

Y = pd.read_csv("/home/l.kuijpers/data/public/10.1101-2020.01.26.919753/bone_marrow/results/dimensional_reduction/pca_features3000_pcs30.txt.gz",sep="\t",header=0, index_col=0)
print(Y.shape)
print(Y.head())

(8349, 30)
        PC1       PC2       PC3       PC4       PC5       PC6       PC7  \
1  7.369592 -1.360963  1.097998  2.538751 -0.803748 -1.658145  1.625979   
2  5.079902 -0.444026 -1.050102 -3.063973  6.900803 -2.164398 -1.502465   
3 -4.539927 -2.935868  0.547350 -2.989443 -0.463145  2.130490 -0.775152   
4 -3.899764 -4.301646  0.405952 -3.153753 -0.564598  2.496788 -0.302337   
5 -3.137664 -4.409472 -0.364785 -2.772346 -0.838818  3.240335 -0.750826   

        PC8       PC9      PC10  ...      PC21      PC22      PC23      PC24  \
1  1.191162  1.149435 -3.081490  ...  1.097954 -0.373919 -0.461403  1.691415   
2  0.028241  0.983676 -1.439774  ...  2.216994 -0.158708 -2.091726  0.317186   
3 -2.985047 -0.613448  1.119282  ...  1.208274 -1.175501  1.174555  0.303052   
4 -4.086857 -1.073136  1.894842  ... -0.243761  0.028752 -0.658735 -0.579332   
5 -3.523591 -1.629615  1.326169  ...  1.418103 -0.182936 -0.260362 -0.149169   

       PC25      PC26      PC27      PC28      PC29      PC30  
1  0.439184 -0.530643  1.338581  0.625986 -0.038665 -1.683088  
2 -1.285308  3.414698 -0.559210  0.472800 -1.886184  0.357121  
3 -0.811011 -0.012366 -2.686822  0.714846 -1.455445  1.618813  
4  0.321042 -0.567756 -0.647596  0.452455  0.327157 -0.813338  
5 -0.127874  0.177051 -0.347021  0.495207  0.066801  0.100596

i then setup the pacmap object and attempt to use pacmap.fit_transform

embedding = pacmap.PaCMAP(n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, random_state=42) 
X_transformed = embedding.fit_transform(X, init=Y)

But get this error

The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

I have tried to make it a numpy array using .to_numpy but then get the following error.

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Maybe i am just doing something stupid
Could you give advice on how to do something like this?

Thanks in advance.

Answer 1 · 2023-12-03T22:06:54.000Z

Hi there, thank you for your interest in PaCMAP, and apologies for my late reply.
I've noticed a few potential problems in the scripts and outputs you presented:

It seems like you are trying to run the embedding with a pandas DataFrame. At this moment, PaCMAP only accepts numpy.ndarray as the input, so you need to convert the array via .to_numpy() for both X and Y, just as you've done. The to_numpy error is weird to me. Would you be able to elaborate it? It is important to keep the brackets -- otherwise it may get interpreted as a functor.
You need to ensure that the matrix has the correct shape. Suppose you have a dataset of 8349 cells and 3000 features, then X should be a matrix of the shape (8349, 3000) instead of (3000, 8349), so you will need to use X = X.T after converting to numpy.ndarray to perform a transpose.
The initialization matrix should also has the same shape as the desired output. In this case, assume you would like the output to be 2D, then you will need to truncate the Y matrix via Y = Y[:, :2], to keep the first two dimensions after the PCA.

Best,
Haiyang

Answer 2 · 2023-12-05T15:29:39.000Z

Hey Haiyang,
Thanks for your reply, no worries about the speed.

I will first elaborate my goal maybe a bit clearer which you might be able to adress immediately. I have multiple single cell datasets for which i perform batch correction with multibatchPCA and MNN. I want to visualize these datasets together with PaCMAP. If i've understood the paper correctly you can do this. however i want to provide the correct multibatchPCA as input and not let it perform its own PCA.

This is what i currently have tried:

My dataframe (this is a test extracted from the anndata, but i treat the large dataset exactly the same way)


print(x.T.head())

x = x.T.to_numpy()
print(x)

cell         PBMC#AAACCTGAGGAACTGC-3  PBMC#AAACCTGAGGAACTGC-6  \
WASH7P                           0.0                      0.0   
MIR6859-1                        0.0                      0.0   
MIR1302-2HG                      0.0                      0.0   
MIR1302-2                        0.0                      0.0   
FAM138A                          0.0                      0.0   

cell         PBMC#AAACCTGAGGAACTGC-7  PBMC#AAACCTGCATGCCCGA-0  \
WASH7P                           0.0                      0.0   
MIR6859-1                        0.0                      0.0   
MIR1302-2HG                      0.0                      0.0   
MIR1302-2                        0.0                      0.0   
FAM138A                          0.0                      0.0   

cell         PBMC#AAACCTGCATGCCCGA-6  PBMC#AAACCTGCATGCCCGA-7  \
WASH7P                           0.0                      0.0   
MIR6859-1                        0.0                      0.0   
MIR1302-2HG                      0.0                      0.0   
MIR1302-2                        0.0                      0.0   
FAM138A                          0.0                      0.0   

cell         PBMC#AAACCTGGTGTGACGA-1  PBMC#AAACCTGGTGTGACGA-2  \
WASH7P                           0.0                      0.0   
MIR6859-1                        0.0                      0.0   
MIR1302-2HG                      0.0                      0.0   
...
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]```

the PCA matrix
```Y = pd.read_csv("/home/l.kuijpers/data/public/10.1101-2020.01.26.919753/bone_marrow/results/dimensional_reduction/UMAP/batch_correction_sample/pca_features3000_pcs30.txt.gz",sep="\t",header=0, index_col=0)

print(Y.iloc[:5,:5])

Y_two = Y.to_numpy()

print(Y_two)

                           PC1    PC2    PC3    PC4    PC5
cell                                                      
PBMC#AAACCTGAGGAACTGC-0  6.017 -3.399 -2.185  2.842  0.399
PBMC#AAACCTGAGGAACTGC-3  4.048 -3.922 -1.826  2.715 -0.371
PBMC#AAACCTGAGGAACTGC-6  5.812 -3.574 -1.920  3.184 -0.353
PBMC#AAACCTGAGGAACTGC-7  5.342 -3.927 -0.973  2.925  0.017
PBMC#AAACCTGCATGCCCGA-0 -1.679 -2.984 -1.909 -3.848 -0.261
[[ 6.017e+00 -3.399e+00 -2.185e+00 ... -6.090e-01 -1.295e+00  5.620e-01]
 [ 4.048e+00 -3.922e+00 -1.826e+00 ...  8.000e-03 -1.580e+00 -6.000e-03]
 [ 5.812e+00 -3.574e+00 -1.920e+00 ... -4.050e-01 -1.122e+00  8.250e-01]
 ...
 [ 3.620e-01 -1.218e+00 -3.789e+00 ...  1.265e+00 -6.690e-01 -9.620e-01]
 [ 7.320e-01 -1.441e+00 -2.313e+00 ...  1.762e+00 -2.730e-01 -6.000e-01]
 [ 1.241e+00 -2.100e+00 -4.307e+00 ...  1.085e+00  2.370e-01 -4.360e-01]]```

creating the embedding
`embedding = pacmap.PaCMAP(n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, random_state=42, apply_pca=False)`

then initialization
This works, and i get a nice graph
`X_transformed = embedding.fit_transform(x)`

This does not, the error is the same (or similar) when i use a pandas.dataframe
```XY_transformed = embedding.fit_transform(x, init=Y_two)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/PaCMAP_testing.ipynb Cell 8 line 3
  PaCMAP_testing.ipynb#X11sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) X_transformed = embedding.fit_transform(x)
----> [3](vscode-notebook-cell://ssh-remote%2B172.25.84.250/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/PaCMAP_testing.ipynb#X11sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2) XY_transformed = embedding.fit_transform(x, init=Y_two)
      [5](vscode-notebook-cell://ssh-remote%2B172.25.84.250/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/PaCMAP_testing.ipynb#X11sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4) XY_transformed = embedding.fit_transform(x, init=Y)

File [~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:943](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:943), in PaCMAP.fit_transform(self, X, init, save_pairs)
    [925](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:925) def fit_transform(self, X, init=None, save_pairs=True):
    [926](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:926)     '''Projects a high dimensional dataset into a low-dimensional embedding and return the embedding.
    [927](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:927) 
    [928](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:928)     Parameters
   (...)
    [940](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:940)         Whether to save the pairs that are sampled from the dataset. Useful for reproducing results.
    [941](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:941)     '''
--> [943](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:943)     self.fit(X, init, save_pairs)
    [944](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:944)     if self.intermediate:
    [945](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:945)         return self.intermediate_states

File [~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:906](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:906), in PaCMAP.fit(self, X, init, save_pairs)
    [904](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:904) self.num_dimensions = X.shape[1]
    [905](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:905) # Initialize and Optimize the embedding
--> [906](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:906) self.embedding_, self.intermediate_states, self.pair_neighbors, self.pair_MN, self.pair_FP = pacmap(
    [907](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:907)     X,
    [908](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:908)     self.n_components,
...
--> [539](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:539) if (Yinit is None or Yinit == "pca"):
    [540](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:540)     if pca_solution:
    [541](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:541)         Y = 0.01 * X[:, :n_dims]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Thanks for the transposing tip, I saw it being done in the tutorials but wasnt sure wether i needed to.
Hope this gives you the info you want.

Kind regards,
Lucas Kuijpers

Answer 3 · 2023-12-07T21:40:23.000Z

I see what you are trying to do here! To avoid PaCMAP to perform PCA during construction of its pairs, you just need to set apply_pca=False during the initialization. To use user-supplied matrix for initialization is a bit tricky now, as the implementation contains a bug. I will try to fix this as soon as possible. At the meantime, you're welcomed to use 'random' initialization or the 'pca' initialization. They are not the optimal solution you would like to see, but since PaCMAP is mostly robust, the result should not be too bad.

Answer 4 · 2023-12-13T23:52:28.000Z

Temporarily solved this issue via a hotfix. You will need to clone this repository, switch to the hotfix branch, and then reinstall PaCMAP via the following command

git clone https://github.com/YingfanWang/PaCMAP.git
git checkout hotfix  # Switch to the hotfix branch
pip uninstall pacmap  # remove the old install
cd PaCMAP
pip install ./

Answer 5 · 2023-12-14T06:33:33.000Z

thanks for the update, i will try and let you know if it worked.

Just to be sure, The approach i was taking is the right one if you want to visualize a dataset that required batch correction?

Answer 6 · 2023-12-15T14:13:24.000Z

Hey Yingfan,

the old error dissapeared, however i am running into an even worse one.

>>> pacMap = embedding.fit_transform(expr_mtx, init=pca_mtx)
Segmentation fault (core dumped)

To allow you to recreate the error i have uploaded my script and data to Dropbox.

There are a few questions i have besides this;

according to your demo you can specify your own nn distance matrix of shape (n, n_neighbors_extra).
However when i am doing that like this

sc.pp.neighbors(adata,n_neighbors=args["knn"], n_pcs=adata.obsm["X_pca"].shape[1])
dist_mtx = adata.obsp["distances"]

embedding = pacmap.PaCMAP(n_components=2, n_neighbors=args["knn"], MN_ratio=args["MN_ratio"], FP_ratio=args["FP_ratio"], random_state=42, apply_pca=False, pair_neighbors = dist_mtx)

I get the error that it is not the correct shape.

I have tried to run the script without init on the embedding.fit_transform() which works and i get a graph as results.
When running with init = pca_mtx i get the segmentation default error.

Finally i do not understand exactly what you mean with making the init matrix only 2 dimensions. I would like to use all 30 PC's (that have been batch corrected) as initialization.
Maybe i am missing the point completely though.

Thanks for the continued help!
Hope to hear from you.

P.S. I am also not a Bioinformatician or mathematician/programmer and therefor make silly mistakes.