scverse/anndata

Only one color in umap after adata.obs.sort_index(inplace=True)

malonzm1 opened this issue · 11 comments

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

After I run adata.obs.sort_index(inplace=True) then,

sc.pp.highly_variable_genes(
    adata,
    n_top_genes=2500,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=20)
sc.tl.umap(adata, min_dist=0.3)
sc.pl.umap(
    adata,
    color=["gse"],
    frameon=False,
    save='_10x_gse.png'
)

I get the following umap which only has a single color:
umap_10x_gse

Versions

session_info        1.0.0
-----
argcomplete         NA
asttokens           NA
backcall            0.2.0
colorama            0.4.6
cython_runtime      NA
dateutil            2.8.2
decorator           5.1.1
exceptiongroup      1.1.3
executing           1.2.0
gmpy2               2.1.2
h5py                3.10.0
jedi                0.19.1
mpmath              1.3.0
natsort             8.4.0
numexpr             2.8.7
numpy               1.24.4
opt_einsum          v3.3.0
packaging           23.2
pandas              2.1.1
parso               0.8.3
pexpect             4.8.0
pickleshare         0.7.5
prompt_toolkit      3.0.39
ptyprocess          0.7.0
pure_eval           0.2.2
pygments            2.16.1
pytz                2023.3
scipy               1.11.3
six                 1.16.0
stack_data          0.6.2
sympy               1.12
torch               2.0.0
tqdm                4.66.1
traitlets           5.11.2
typing_extensions   NA
wcwidth             0.2.8
zoneinfo            NA
-----
Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) [GCC 12.3.0]
Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.17
-----
Session information updated at 2024-02-28 09:04

Thanks for opening the report.

I don't think that you are actually only getting one color in your UMAP. I think that's the colors are now basically random because the obs dataframe is no longer aligned to the rest of the object. Your code is sorting obs, but leaving the rest of the object in the original order.

I think you should be doing something more like:

adata = adata[adata.obs_names.sort_values(), :].copy()

Does this resolve your issue?

Thanks. When I use adata = adata[adata.obs_names.sort_values(), :].copy() and I try to save it with adata.write_h5ad I get the following error:

ValueError: Output dtype not compatible with inputs.

Could you share the full traceback? Can you also share environment info with anndata imported?

/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1906: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/utils.py:260: UserWarning: Suffix used (-[0-9]+) to deduplicate index values may make index values difficult to interpret. There values with a similar suffixes in the index. Consider using a different delimiter by passing `join={delimiter}`Example key collisions generated by the make_index_unique algorithm: ['AAACGGGGTTCATGGT-1-1', 'AAAGCAAAGATGCCTT-1-1', 'AAAGCAAAGATGCCTT-1-2', 'AAAGCAAAGATGCCTT-1-3', 'AACACGTCAGCCTTGG-1-1']
  warnings.warn(
Traceback (most recent call last):
  File "/scratch/cs/csb/users/malonzm1/scripts/pan-autoimmune/scripts/scvi/combine_10x_0.py", line 17, in <module>
    adata = adata[adata.obs_names.sort_values(), :].copy()
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1592, in copy
    X=_subset(self._adata_ref.X, (self._oidx, self._vidx)).copy()
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/functools.py", line 888, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/_core/index.py", line 177, in _subset_spmatrix
    return a[subset_idx]
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/scipy/sparse/_index.py", line 78, in __getitem__
    return self._get_arrayXslice(row, col)
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/scipy/sparse/_csr.py", line 330, in _get_arrayXslice
    return self._major_index_fancy(row)._get_submatrix(minor=col)
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/scipy/sparse/_compressed.py", line 708, in _major_index_fancy
    csr_row_index(M, indices, self.indptr, self.indices, self.data,
ValueError: Output dtype not compatible with inputs.
srun: error: fn3: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=29260470.0
session-info
session_info        1.0.0
-----
argcomplete         NA
asttokens           NA
backcall            0.2.0
colorama            0.4.6
cython_runtime      NA
dateutil            2.8.2
decorator           5.1.1
exceptiongroup      1.1.3
executing           1.2.0
gmpy2               2.1.2
h5py                3.10.0
jedi                0.19.1
mpmath              1.3.0
natsort             8.4.0
numexpr             2.8.7
numpy               1.24.4
opt_einsum          v3.3.0
packaging           23.2
pandas              2.1.1
parso               0.8.3
pexpect             4.8.0
pickleshare         0.7.5
prompt_toolkit      3.0.39
ptyprocess          0.7.0
pure_eval           0.2.2
pygments            2.16.1
pytz                2023.3
scipy               1.11.3
six                 1.16.0
stack_data          0.6.2
sympy               1.12
torch               2.0.0
tqdm                4.66.1
traitlets           5.11.2
typing_extensions   NA
wcwidth             0.2.8
zoneinfo            NA
-----
Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) [GCC 12.3.0]
Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.17
-----
Session information updated at 2024-03-05 04:17

When reporting your environment with session-info, anndata needs to be imported for it and its dependencies to show up. E.g. you need to run something like:

import anndata, session_info; session_info.show(html=False, dependencies=True)

I think this is a related to:

Which you can work around by making sure each sparse matrix has an indptr and indices of the same dtype. E.g.

X_int64_idx = adata.X.copy()
X_int64_idx.indptr  = X_int64_idx.indptr.astype(np.int64)
X_int64_idx.indices = X_int64_idx.indices.astype(np.int64)
adata.X = X_int64_idx

Any idea how you ended up with those arrays having a different dtype? What size is this array?

I did run import anndata, session_info; session_info.show(html=False, dependencies=True).

No idea. The array is 3428130 × 36601.

I did run import anndata, session_info; session_info.show(html=False, dependencies=True).

That's odd. It should be showing the version of anndata in the session info report. Could be worth reporting this to session-info.


Once you get over np.iinfo(np.int32).max (2,147,483,647) non-zero entries in your matrix, theintptr and indices of your array start needing to be 64 bit to accommodate all the values. Let me know if the work around I proposed fixed this issue for you.

Thanks @ivirshup. I tried

X_int64_idx = adata[adata.obs_names.sort_values(), :].copy()
X_int64_idx.indptr  = X_int64_idx.indptr.astype(np.int64)
X_int64_idx.indices = X_int64_idx.indices.astype(np.int64)
adata = X_int64_idx
adata.write_h5ad(filename='temp.h5ad')

which generated the same error.

Can you share the code and possibly h5ad? Letting me know which version of anndata you are using would be helpful too.

Also you have to upcast the indices and indptr before doing the indexing. As you can see in the first line of your traceback (assuming this is the same)

Traceback (most recent call last):
  File "/scratch/cs/csb/users/malonzm1/scripts/pan-autoimmune/scripts/scvi/combine_10x_0.py", line 17, in <module>
    adata = adata[adata.obs_names.sort_values(), :].copy()

The error is coming from the subset operation, not the indexing operation.

So, the code should be more like:

# Upcast indices and indptr
X_int64_idx = adata.X.copy()
X_int64_idx.indptr  = X_int64_idx.indptr.astype(np.int64)
X_int64_idx.indices = X_int64_idx.indices.astype(np.int64)
adata.X = X_int64_idx

# Re-order
adata = adata[adata.obs_names.sort_values(), :].copy()

# Write
adata.write_h5ad(filename='temp.h5ad')

It worked. Thanks!