Only one color in umap after adata.obs.sort_index(inplace=True)
malonzm1 opened this issue · 11 comments
Please make sure these conditions are met
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of anndata.
- (optional) I have confirmed this bug exists on the master branch of anndata.
Report
After I run adata.obs.sort_index(inplace=True)
then,
sc.pp.highly_variable_genes(
adata,
n_top_genes=2500,
subset=True,
layer="counts",
flavor="seurat_v3",
)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=20)
sc.tl.umap(adata, min_dist=0.3)
sc.pl.umap(
adata,
color=["gse"],
frameon=False,
save='_10x_gse.png'
)
I get the following umap which only has a single color:
Versions
session_info 1.0.0
-----
argcomplete NA
asttokens NA
backcall 0.2.0
colorama 0.4.6
cython_runtime NA
dateutil 2.8.2
decorator 5.1.1
exceptiongroup 1.1.3
executing 1.2.0
gmpy2 2.1.2
h5py 3.10.0
jedi 0.19.1
mpmath 1.3.0
natsort 8.4.0
numexpr 2.8.7
numpy 1.24.4
opt_einsum v3.3.0
packaging 23.2
pandas 2.1.1
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
prompt_toolkit 3.0.39
ptyprocess 0.7.0
pure_eval 0.2.2
pygments 2.16.1
pytz 2023.3
scipy 1.11.3
six 1.16.0
stack_data 0.6.2
sympy 1.12
torch 2.0.0
tqdm 4.66.1
traitlets 5.11.2
typing_extensions NA
wcwidth 0.2.8
zoneinfo NA
-----
Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) [GCC 12.3.0]
Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.17
-----
Session information updated at 2024-02-28 09:04
Thanks for opening the report.
I don't think that you are actually only getting one color in your UMAP. I think that's the colors are now basically random because the obs
dataframe is no longer aligned to the rest of the object. Your code is sorting obs
, but leaving the rest of the object in the original order.
I think you should be doing something more like:
adata = adata[adata.obs_names.sort_values(), :].copy()
Does this resolve your issue?
Thanks. When I use adata = adata[adata.obs_names.sort_values(), :].copy()
and I try to save it with adata.write_h5ad
I get the following error:
ValueError: Output dtype not compatible with inputs.
Could you share the full traceback? Can you also share environment info with anndata imported?
/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1906: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/utils.py:260: UserWarning: Suffix used (-[0-9]+) to deduplicate index values may make index values difficult to interpret. There values with a similar suffixes in the index. Consider using a different delimiter by passing `join={delimiter}`Example key collisions generated by the make_index_unique algorithm: ['AAACGGGGTTCATGGT-1-1', 'AAAGCAAAGATGCCTT-1-1', 'AAAGCAAAGATGCCTT-1-2', 'AAAGCAAAGATGCCTT-1-3', 'AACACGTCAGCCTTGG-1-1']
warnings.warn(
Traceback (most recent call last):
File "/scratch/cs/csb/users/malonzm1/scripts/pan-autoimmune/scripts/scvi/combine_10x_0.py", line 17, in <module>
adata = adata[adata.obs_names.sort_values(), :].copy()
File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1592, in copy
X=_subset(self._adata_ref.X, (self._oidx, self._vidx)).copy()
File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/functools.py", line 888, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/_core/index.py", line 177, in _subset_spmatrix
return a[subset_idx]
File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/scipy/sparse/_index.py", line 78, in __getitem__
return self._get_arrayXslice(row, col)
File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/scipy/sparse/_csr.py", line 330, in _get_arrayXslice
return self._major_index_fancy(row)._get_submatrix(minor=col)
File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/scipy/sparse/_compressed.py", line 708, in _major_index_fancy
csr_row_index(M, indices, self.indptr, self.indices, self.data,
ValueError: Output dtype not compatible with inputs.
srun: error: fn3: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=29260470.0
session-info
session_info 1.0.0
-----
argcomplete NA
asttokens NA
backcall 0.2.0
colorama 0.4.6
cython_runtime NA
dateutil 2.8.2
decorator 5.1.1
exceptiongroup 1.1.3
executing 1.2.0
gmpy2 2.1.2
h5py 3.10.0
jedi 0.19.1
mpmath 1.3.0
natsort 8.4.0
numexpr 2.8.7
numpy 1.24.4
opt_einsum v3.3.0
packaging 23.2
pandas 2.1.1
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
prompt_toolkit 3.0.39
ptyprocess 0.7.0
pure_eval 0.2.2
pygments 2.16.1
pytz 2023.3
scipy 1.11.3
six 1.16.0
stack_data 0.6.2
sympy 1.12
torch 2.0.0
tqdm 4.66.1
traitlets 5.11.2
typing_extensions NA
wcwidth 0.2.8
zoneinfo NA
-----
Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) [GCC 12.3.0]
Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.17
-----
Session information updated at 2024-03-05 04:17
When reporting your environment with session-info
, anndata
needs to be imported for it and its dependencies to show up. E.g. you need to run something like:
import anndata, session_info; session_info.show(html=False, dependencies=True)
I think this is a related to:
Which you can work around by making sure each sparse matrix has an indptr
and indices
of the same dtype. E.g.
X_int64_idx = adata.X.copy()
X_int64_idx.indptr = X_int64_idx.indptr.astype(np.int64)
X_int64_idx.indices = X_int64_idx.indices.astype(np.int64)
adata.X = X_int64_idx
Any idea how you ended up with those arrays having a different dtype
? What size is this array?
I did run import anndata, session_info; session_info.show(html=False, dependencies=True)
.
No idea. The array is 3428130 × 36601.
I did run import anndata, session_info; session_info.show(html=False, dependencies=True).
That's odd. It should be showing the version of anndata
in the session info report. Could be worth reporting this to session-info
.
Once you get over np.iinfo(np.int32).max
(2,147,483,647) non-zero entries in your matrix, theintptr
and indices
of your array start needing to be 64 bit to accommodate all the values. Let me know if the work around I proposed fixed this issue for you.
Thanks @ivirshup. I tried
X_int64_idx = adata[adata.obs_names.sort_values(), :].copy()
X_int64_idx.indptr = X_int64_idx.indptr.astype(np.int64)
X_int64_idx.indices = X_int64_idx.indices.astype(np.int64)
adata = X_int64_idx
adata.write_h5ad(filename='temp.h5ad')
which generated the same error.
Can you share the code and possibly h5ad
? Letting me know which version of anndata
you are using would be helpful too.
Also you have to upcast the indices
and indptr
before doing the indexing. As you can see in the first line of your traceback (assuming this is the same)
Traceback (most recent call last):
File "/scratch/cs/csb/users/malonzm1/scripts/pan-autoimmune/scripts/scvi/combine_10x_0.py", line 17, in <module>
adata = adata[adata.obs_names.sort_values(), :].copy()
The error is coming from the subset operation, not the indexing operation.
So, the code should be more like:
# Upcast indices and indptr
X_int64_idx = adata.X.copy()
X_int64_idx.indptr = X_int64_idx.indptr.astype(np.int64)
X_int64_idx.indices = X_int64_idx.indices.astype(np.int64)
adata.X = X_int64_idx
# Re-order
adata = adata[adata.obs_names.sort_values(), :].copy()
# Write
adata.write_h5ad(filename='temp.h5ad')
It worked. Thanks!