Only one color in umap after adata.obs.sort_index(inplace=True)

Question

Only one color in umap after adata.obs.sort_index(inplace=True)

malonzm1 opened this issue 7 months ago · 11 comments

Please make sure these conditions are met

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of anndata.
(optional) I have confirmed this bug exists on the master branch of anndata.

Report

After I run adata.obs.sort_index(inplace=True) then,

sc.pp.highly_variable_genes(
    adata,
    n_top_genes=2500,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=20)
sc.tl.umap(adata, min_dist=0.3)
sc.pl.umap(
    adata,
    color=["gse"],
    frameon=False,
    save='_10x_gse.png'
)

I get the following umap which only has a single color:

Versions

session_info        1.0.0
-----
argcomplete         NA
asttokens           NA
backcall            0.2.0
colorama            0.4.6
cython_runtime      NA
dateutil            2.8.2
decorator           5.1.1
exceptiongroup      1.1.3
executing           1.2.0
gmpy2               2.1.2
h5py                3.10.0
jedi                0.19.1
mpmath              1.3.0
natsort             8.4.0
numexpr             2.8.7
numpy               1.24.4
opt_einsum          v3.3.0
packaging           23.2
pandas              2.1.1
parso               0.8.3
pexpect             4.8.0
pickleshare         0.7.5
prompt_toolkit      3.0.39
ptyprocess          0.7.0
pure_eval           0.2.2
pygments            2.16.1
pytz                2023.3
scipy               1.11.3
six                 1.16.0
stack_data          0.6.2
sympy               1.12
torch               2.0.0
tqdm                4.66.1
traitlets           5.11.2
typing_extensions   NA
wcwidth             0.2.8
zoneinfo            NA
-----
Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) [GCC 12.3.0]
Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.17
-----
Session information updated at 2024-02-28 09:04

Answer 1 · 2024-03-01T15:26:38.000Z

Thanks for opening the report.

I don't think that you are actually only getting one color in your UMAP. I think that's the colors are now basically random because the obs dataframe is no longer aligned to the rest of the object. Your code is sorting obs, but leaving the rest of the object in the original order.

I think you should be doing something more like:

adata = adata[adata.obs_names.sort_values(), :].copy()

Does this resolve your issue?

Answer 2 · 2024-03-02T04:13:59.000Z

Thanks. When I use adata = adata[adata.obs_names.sort_values(), :].copy() and I try to save it with adata.write_h5ad I get the following error:

ValueError: Output dtype not compatible with inputs.

Answer 3 · 2024-03-04T10:58:06.000Z

Could you share the full traceback? Can you also share environment info with anndata imported?

Answer 4 · 2024-03-05T02:19:42.000Z

/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1906: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/utils.py:260: UserWarning: Suffix used (-[0-9]+) to deduplicate index values may make index values difficult to interpret. There values with a similar suffixes in the index. Consider using a different delimiter by passing `join={delimiter}`Example key collisions generated by the make_index_unique algorithm: ['AAACGGGGTTCATGGT-1-1', 'AAAGCAAAGATGCCTT-1-1', 'AAAGCAAAGATGCCTT-1-2', 'AAAGCAAAGATGCCTT-1-3', 'AACACGTCAGCCTTGG-1-1']
  warnings.warn(
Traceback (most recent call last):
  File "/scratch/cs/csb/users/malonzm1/scripts/pan-autoimmune/scripts/scvi/combine_10x_0.py", line 17, in <module>
    adata = adata[adata.obs_names.sort_values(), :].copy()
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1592, in copy
    X=_subset(self._adata_ref.X, (self._oidx, self._vidx)).copy()
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/functools.py", line 888, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/anndata/_core/index.py", line 177, in _subset_spmatrix
    return a[subset_idx]
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/scipy/sparse/_index.py", line 78, in __getitem__
    return self._get_arrayXslice(row, col)
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/scipy/sparse/_csr.py", line 330, in _get_arrayXslice
    return self._major_index_fancy(row)._get_submatrix(minor=col)
  File "/scratch/work/malonzm1/.conda_envs/scvi-env/lib/python3.9/site-packages/scipy/sparse/_compressed.py", line 708, in _major_index_fancy
    csr_row_index(M, indices, self.indptr, self.indices, self.data,
ValueError: Output dtype not compatible with inputs.
srun: error: fn3: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=29260470.0

session-info

session_info        1.0.0
-----
argcomplete         NA
asttokens           NA
backcall            0.2.0
colorama            0.4.6
cython_runtime      NA
dateutil            2.8.2
decorator           5.1.1
exceptiongroup      1.1.3
executing           1.2.0
gmpy2               2.1.2
h5py                3.10.0
jedi                0.19.1
mpmath              1.3.0
natsort             8.4.0
numexpr             2.8.7
numpy               1.24.4
opt_einsum          v3.3.0
packaging           23.2
pandas              2.1.1
parso               0.8.3
pexpect             4.8.0
pickleshare         0.7.5
prompt_toolkit      3.0.39
ptyprocess          0.7.0
pure_eval           0.2.2
pygments            2.16.1
pytz                2023.3
scipy               1.11.3
six                 1.16.0
stack_data          0.6.2
sympy               1.12
torch               2.0.0
tqdm                4.66.1
traitlets           5.11.2
typing_extensions   NA
wcwidth             0.2.8
zoneinfo            NA
-----
Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) [GCC 12.3.0]
Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.17
-----
Session information updated at 2024-03-05 04:17

Answer 5 · 2024-03-05T10:22:12.000Z

When reporting your environment with session-info, anndata needs to be imported for it and its dependencies to show up. E.g. you need to run something like:

import anndata, session_info; session_info.show(html=False, dependencies=True)

I think this is a related to:

scipy/scipy#20182

Which you can work around by making sure each sparse matrix has an indptr and indices of the same dtype. E.g.

X_int64_idx = adata.X.copy()
X_int64_idx.indptr  = X_int64_idx.indptr.astype(np.int64)
X_int64_idx.indices = X_int64_idx.indices.astype(np.int64)
adata.X = X_int64_idx

Any idea how you ended up with those arrays having a different dtype? What size is this array?

Answer 6 · 2024-03-05T10:27:22.000Z

I did run import anndata, session_info; session_info.show(html=False, dependencies=True).

No idea. The array is 3428130 × 36601.

Answer 7 · 2024-03-05T10:50:20.000Z

I did run import anndata, session_info; session_info.show(html=False, dependencies=True).

That's odd. It should be showing the version of anndata in the session info report. Could be worth reporting this to session-info.

Once you get over np.iinfo(np.int32).max (2,147,483,647) non-zero entries in your matrix, theintptr and indices of your array start needing to be 64 bit to accommodate all the values. Let me know if the work around I proposed fixed this issue for you.

Answer 8 · 2024-03-06T07:15:51.000Z

Thanks @ivirshup. I tried

X_int64_idx = adata[adata.obs_names.sort_values(), :].copy()
X_int64_idx.indptr  = X_int64_idx.indptr.astype(np.int64)
X_int64_idx.indices = X_int64_idx.indices.astype(np.int64)
adata = X_int64_idx
adata.write_h5ad(filename='temp.h5ad')

which generated the same error.

Answer 9 · 2024-03-06T13:07:41.000Z

Can you share the code and possibly h5ad? Letting me know which version of anndata you are using would be helpful too.

Answer 10 · 2024-03-06T13:16:36.000Z

Also you have to upcast the indices and indptr before doing the indexing. As you can see in the first line of your traceback (assuming this is the same)

Traceback (most recent call last):
  File "/scratch/cs/csb/users/malonzm1/scripts/pan-autoimmune/scripts/scvi/combine_10x_0.py", line 17, in <module>
    adata = adata[adata.obs_names.sort_values(), :].copy()

The error is coming from the subset operation, not the indexing operation.

So, the code should be more like:

# Upcast indices and indptr
X_int64_idx = adata.X.copy()
X_int64_idx.indptr  = X_int64_idx.indptr.astype(np.int64)
X_int64_idx.indices = X_int64_idx.indices.astype(np.int64)
adata.X = X_int64_idx

# Re-order
adata = adata[adata.obs_names.sort_values(), :].copy()

# Write
adata.write_h5ad(filename='temp.h5ad')

Answer 11 · 2024-03-08T07:31:22.000Z

It worked. Thanks!