init with user specified matrix
Opened this issue · 6 comments
Hello,
I have a question regardin the initializing with a user specified matrix.
Currently i am creating a pipeline that takes multiple single cell samples and i am correting for batches using the R batchelor which uses multibatchPCA and MNN for batch correction.
Currently i am still using a single dataset to setup the pipeline. However, i run into trouble trying to initialize with my PCA matrix
first i load my dataset (logcounts)
X = pd.read_csv("/home/l.kuijpers/data/public/10.1101-2020.01.26.919753/bone_marrow/results/dimensional_reduction/hvgs_filt_sce_features3000.txt.gz",sep="\t", index_col=0,header=0).to_numpy()
print(X)
print(X.shape)
[[4.71391225 4.47825845 4.31347248 ... 1.07156968 0. 1.22630936]
[5.04615844 2.31952633 1.81465286 ... 0. 0. 0. ]
[3.96012334 4.78718275 4.84273991 ... 5.99543869 6.08122726 5.58161319]
...
[0. 0. 0. ... 0. 0. 0. ]
[0.97886969 0. 1.17560436 ... 1.67962092 0. 1.22630936]
[0. 1.22067729 0. ... 0. 0. 0. ]]
(3000, 8349)
and subsequently my PCA matrix
Y = pd.read_csv("/home/l.kuijpers/data/public/10.1101-2020.01.26.919753/bone_marrow/results/dimensional_reduction/pca_features3000_pcs30.txt.gz",sep="\t",header=0, index_col=0)
print(Y.shape)
print(Y.head())
(8349, 30)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 \
1 7.369592 -1.360963 1.097998 2.538751 -0.803748 -1.658145 1.625979
2 5.079902 -0.444026 -1.050102 -3.063973 6.900803 -2.164398 -1.502465
3 -4.539927 -2.935868 0.547350 -2.989443 -0.463145 2.130490 -0.775152
4 -3.899764 -4.301646 0.405952 -3.153753 -0.564598 2.496788 -0.302337
5 -3.137664 -4.409472 -0.364785 -2.772346 -0.838818 3.240335 -0.750826
PC8 PC9 PC10 ... PC21 PC22 PC23 PC24 \
1 1.191162 1.149435 -3.081490 ... 1.097954 -0.373919 -0.461403 1.691415
2 0.028241 0.983676 -1.439774 ... 2.216994 -0.158708 -2.091726 0.317186
3 -2.985047 -0.613448 1.119282 ... 1.208274 -1.175501 1.174555 0.303052
4 -4.086857 -1.073136 1.894842 ... -0.243761 0.028752 -0.658735 -0.579332
5 -3.523591 -1.629615 1.326169 ... 1.418103 -0.182936 -0.260362 -0.149169
PC25 PC26 PC27 PC28 PC29 PC30
1 0.439184 -0.530643 1.338581 0.625986 -0.038665 -1.683088
2 -1.285308 3.414698 -0.559210 0.472800 -1.886184 0.357121
3 -0.811011 -0.012366 -2.686822 0.714846 -1.455445 1.618813
4 0.321042 -0.567756 -0.647596 0.452455 0.327157 -0.813338
5 -0.127874 0.177051 -0.347021 0.495207 0.066801 0.100596
i then setup the pacmap object and attempt to use pacmap.fit_transform
embedding = pacmap.PaCMAP(n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
X_transformed = embedding.fit_transform(X, init=Y)
But get this error
The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
I have tried to make it a numpy array using .to_numpy
but then get the following error.
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Maybe i am just doing something stupid
Could you give advice on how to do something like this?
Thanks in advance.
Hi there, thank you for your interest in PaCMAP, and apologies for my late reply.
I've noticed a few potential problems in the scripts and outputs you presented:
- It seems like you are trying to run the embedding with a pandas DataFrame. At this moment, PaCMAP only accepts
numpy.ndarray
as the input, so you need to convert the array via.to_numpy()
for bothX
andY
, just as you've done. Theto_numpy
error is weird to me. Would you be able to elaborate it? It is important to keep the brackets -- otherwise it may get interpreted as a functor. - You need to ensure that the matrix has the correct shape. Suppose you have a dataset of 8349 cells and 3000 features, then X should be a matrix of the shape (8349, 3000) instead of (3000, 8349), so you will need to use
X = X.T
after converting tonumpy.ndarray
to perform a transpose. - The initialization matrix should also has the same shape as the desired output. In this case, assume you would like the output to be 2D, then you will need to truncate the
Y
matrix viaY = Y[:, :2]
, to keep the first two dimensions after the PCA.
Best,
Haiyang
Hey Haiyang,
Thanks for your reply, no worries about the speed.
I will first elaborate my goal maybe a bit clearer which you might be able to adress immediately. I have multiple single cell datasets for which i perform batch correction with multibatchPCA and MNN. I want to visualize these datasets together with PaCMAP. If i've understood the paper correctly you can do this. however i want to provide the correct multibatchPCA as input and not let it perform its own PCA.
This is what i currently have tried:
My dataframe (this is a test extracted from the anndata, but i treat the large dataset exactly the same way)
print(x.T.head())
x = x.T.to_numpy()
print(x)
cell PBMC#AAACCTGAGGAACTGC-3 PBMC#AAACCTGAGGAACTGC-6 \
WASH7P 0.0 0.0
MIR6859-1 0.0 0.0
MIR1302-2HG 0.0 0.0
MIR1302-2 0.0 0.0
FAM138A 0.0 0.0
cell PBMC#AAACCTGAGGAACTGC-7 PBMC#AAACCTGCATGCCCGA-0 \
WASH7P 0.0 0.0
MIR6859-1 0.0 0.0
MIR1302-2HG 0.0 0.0
MIR1302-2 0.0 0.0
FAM138A 0.0 0.0
cell PBMC#AAACCTGCATGCCCGA-6 PBMC#AAACCTGCATGCCCGA-7 \
WASH7P 0.0 0.0
MIR6859-1 0.0 0.0
MIR1302-2HG 0.0 0.0
MIR1302-2 0.0 0.0
FAM138A 0.0 0.0
cell PBMC#AAACCTGGTGTGACGA-1 PBMC#AAACCTGGTGTGACGA-2 \
WASH7P 0.0 0.0
MIR6859-1 0.0 0.0
MIR1302-2HG 0.0 0.0
...
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]```
the PCA matrix
```Y = pd.read_csv("/home/l.kuijpers/data/public/10.1101-2020.01.26.919753/bone_marrow/results/dimensional_reduction/UMAP/batch_correction_sample/pca_features3000_pcs30.txt.gz",sep="\t",header=0, index_col=0)
print(Y.iloc[:5,:5])
Y_two = Y.to_numpy()
print(Y_two)
PC1 PC2 PC3 PC4 PC5
cell
PBMC#AAACCTGAGGAACTGC-0 6.017 -3.399 -2.185 2.842 0.399
PBMC#AAACCTGAGGAACTGC-3 4.048 -3.922 -1.826 2.715 -0.371
PBMC#AAACCTGAGGAACTGC-6 5.812 -3.574 -1.920 3.184 -0.353
PBMC#AAACCTGAGGAACTGC-7 5.342 -3.927 -0.973 2.925 0.017
PBMC#AAACCTGCATGCCCGA-0 -1.679 -2.984 -1.909 -3.848 -0.261
[[ 6.017e+00 -3.399e+00 -2.185e+00 ... -6.090e-01 -1.295e+00 5.620e-01]
[ 4.048e+00 -3.922e+00 -1.826e+00 ... 8.000e-03 -1.580e+00 -6.000e-03]
[ 5.812e+00 -3.574e+00 -1.920e+00 ... -4.050e-01 -1.122e+00 8.250e-01]
...
[ 3.620e-01 -1.218e+00 -3.789e+00 ... 1.265e+00 -6.690e-01 -9.620e-01]
[ 7.320e-01 -1.441e+00 -2.313e+00 ... 1.762e+00 -2.730e-01 -6.000e-01]
[ 1.241e+00 -2.100e+00 -4.307e+00 ... 1.085e+00 2.370e-01 -4.360e-01]]```
creating the embedding
`embedding = pacmap.PaCMAP(n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, random_state=42, apply_pca=False)`
then initialization
This works, and i get a nice graph
`X_transformed = embedding.fit_transform(x)`
This does not, the error is the same (or similar) when i use a pandas.dataframe
```XY_transformed = embedding.fit_transform(x, init=Y_two)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/PaCMAP_testing.ipynb Cell 8 line 3
PaCMAP_testing.ipynb#X11sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) X_transformed = embedding.fit_transform(x)
----> [3](vscode-notebook-cell://ssh-remote%2B172.25.84.250/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/PaCMAP_testing.ipynb#X11sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2) XY_transformed = embedding.fit_transform(x, init=Y_two)
[5](vscode-notebook-cell://ssh-remote%2B172.25.84.250/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/PaCMAP_testing.ipynb#X11sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4) XY_transformed = embedding.fit_transform(x, init=Y)
File [~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:943](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:943), in PaCMAP.fit_transform(self, X, init, save_pairs)
[925](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:925) def fit_transform(self, X, init=None, save_pairs=True):
[926](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:926) '''Projects a high dimensional dataset into a low-dimensional embedding and return the embedding.
[927](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:927)
[928](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:928) Parameters
(...)
[940](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:940) Whether to save the pairs that are sampled from the dataset. Useful for reproducing results.
[941](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:941) '''
--> [943](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:943) self.fit(X, init, save_pairs)
[944](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:944) if self.intermediate:
[945](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:945) return self.intermediate_states
File [~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:906](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:906), in PaCMAP.fit(self, X, init, save_pairs)
[904](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:904) self.num_dimensions = X.shape[1]
[905](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:905) # Initialize and Optimize the embedding
--> [906](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:906) self.embedding_, self.intermediate_states, self.pair_neighbors, self.pair_MN, self.pair_FP = pacmap(
[907](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:907) X,
[908](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:908) self.n_components,
...
--> [539](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:539) if (Yinit is None or Yinit == "pca"):
[540](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:540) if pca_solution:
[541](https://vscode-remote+ssh-002dremote-002b172-002e25-002e84-002e250.vscode-resource.vscode-cdn.net/home/l.kuijpers/scripts/single_cell_snakemake/dimensional_reduction/~/miniconda3/envs/Renv/lib/python3.11/site-packages/pacmap/pacmap.py:541) Y = 0.01 * X[:, :n_dims]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Thanks for the transposing tip, I saw it being done in the tutorials but wasnt sure wether i needed to.
Hope this gives you the info you want.
Kind regards,
Lucas Kuijpers
I see what you are trying to do here! To avoid PaCMAP to perform PCA during construction of its pairs, you just need to set apply_pca=False
during the initialization. To use user-supplied matrix for initialization is a bit tricky now, as the implementation contains a bug. I will try to fix this as soon as possible. At the meantime, you're welcomed to use 'random' initialization or the 'pca' initialization. They are not the optimal solution you would like to see, but since PaCMAP is mostly robust, the result should not be too bad.
Temporarily solved this issue via a hotfix. You will need to clone this repository, switch to the hotfix
branch, and then reinstall PaCMAP via the following command
git clone https://github.com/YingfanWang/PaCMAP.git
git checkout hotfix # Switch to the hotfix branch
pip uninstall pacmap # remove the old install
cd PaCMAP
pip install ./
thanks for the update, i will try and let you know if it worked.
Just to be sure, The approach i was taking is the right one if you want to visualize a dataset that required batch correction?
Hey Yingfan,
the old error dissapeared, however i am running into an even worse one.
>>> pacMap = embedding.fit_transform(expr_mtx, init=pca_mtx)
Segmentation fault (core dumped)
To allow you to recreate the error i have uploaded my script and data to Dropbox.
There are a few questions i have besides this;
- according to your demo you can specify your own nn distance matrix of shape (n, n_neighbors_extra).
However when i am doing that like this
sc.pp.neighbors(adata,n_neighbors=args["knn"], n_pcs=adata.obsm["X_pca"].shape[1])
dist_mtx = adata.obsp["distances"]
embedding = pacmap.PaCMAP(n_components=2, n_neighbors=args["knn"], MN_ratio=args["MN_ratio"], FP_ratio=args["FP_ratio"], random_state=42, apply_pca=False, pair_neighbors = dist_mtx)
I get the error that it is not the correct shape.
- I have tried to run the script without init on the
embedding.fit_transform()
which works and i get a graph as results.
When running withinit = pca_mtx
i get the segmentation default error.
Finally i do not understand exactly what you mean with making the init matrix only 2 dimensions. I would like to use all 30 PC's (that have been batch corrected) as initialization.
Maybe i am missing the point completely though.
Thanks for the continued help!
Hope to hear from you.
P.S. I am also not a Bioinformatician or mathematician/programmer and therefor make silly mistakes.