PMBio/MuDataSeurat

`mu.read` ValueError

mdmanurung opened this issue · 2 comments

Dear author,

I encountered the following issue upon reading a mudata object that was converted from Seurat:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [6], in <module>
----> 1 mdata2 = mu.read("data/processed/mudata.h5mu/ADT")

File /exports/para-lipg-hpc/mdmanurung/conda/envs/scanpy/lib/python3.8/site-packages/mudata/_core/io.py:409, in read(filename, **kwargs)
    406     return read_h5mu(filepath, **kwargs)
    407 elif m[3] == "":
    408     # .h5mu/<modality>
--> 409     return read_h5ad(filepath, m[2], **kwargs)
    410 elif m[2] == "mod":
    411     # .h5mu/mod/<modality>
    412     return read_h5ad(filepath, m[3], **kwargs)

File /exports/para-lipg-hpc/mdmanurung/conda/envs/scanpy/lib/python3.8/site-packages/mudata/_core/io.py:372, in read_h5ad(filename, mod, backed)
    370 with h5py.File(filename, hdf5_mode) as f_root:
    371     f = f_root["mod"][mod]
--> 372     return _read_h5mu_mod(f, manager, backed)

File /exports/para-lipg-hpc/mdmanurung/conda/envs/scanpy/lib/python3.8/site-packages/mudata/_core/io.py:320, in _read_h5mu_mod(g, manager, backed)
    318     elif k != "raw":
    319         d[k] = read_attribute(g[k])
--> 320 ad = AnnData(**d)
    321 if manager is not None:
    322     ad.file = AnnDataFileManager(ad, os.path.basename(g.name), manager)

File /exports/para-lipg-hpc/mdmanurung/conda/envs/scanpy/lib/python3.8/site-packages/anndata/_core/anndata.py:308, in AnnData.__init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
    306     self._init_as_view(X, oidx, vidx)
    307 else:
--> 308     self._init_as_actual(
    309         X=X,
    310         obs=obs,
    311         var=var,
    312         uns=uns,
    313         obsm=obsm,
    314         varm=varm,
    315         raw=raw,
    316         layers=layers,
    317         dtype=dtype,
    318         shape=shape,
    319         obsp=obsp,
    320         varp=varp,
    321         filename=filename,
    322         filemode=filemode,
    323     )

File /exports/para-lipg-hpc/mdmanurung/conda/envs/scanpy/lib/python3.8/site-packages/anndata/_core/anndata.py:526, in AnnData._init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
    523 # Backwards compat for connectivities matrices in uns["neighbors"]
    524 _move_adj_mtx({"uns": self._uns, "obsp": self._obsp})
--> 526 self._check_dimensions()
    527 self._check_uniqueness()
    529 if self.filename:

File /exports/para-lipg-hpc/mdmanurung/conda/envs/scanpy/lib/python3.8/site-packages/anndata/_core/anndata.py:1837, in AnnData._check_dimensions(self, key)
   1835     key = {key}
   1836 if "obs" in key and len(self._obs) != self._n_obs:
-> 1837     raise ValueError(
   1838         "Observations annot. `obs` must have number of rows of `X`"
   1839         f" ({self._n_obs}), but has {self._obs.shape[0]} rows."
   1840     )
   1841 if "var" in key and len(self._var) != self._n_vars:
   1842     raise ValueError(
   1843         "Variables annot. `var` must have number of columns of `X`"
   1844         f" ({self._n_vars}), but has {self._var.shape[0]} rows."
   1845     )

ValueError: Observations annot. `obs` must have number of rows of `X` (163), but has 62773 rows.

I then tried to load each modality one by one. I could load my RNA data, but not my ADT. My ADT data has 163 features in it. For both modalities, I have 62773 observations.

Considering that, I am a bit confused by the error. Why would obs of my ADT data expect 163 rows, which should be the number of features?

Thanks for taking the time.

Regards,
Mikhael

Why would obs of my ADT data expect 163 rows, which should be the number of features?

This is due to how R vs Numpy store matrices. R stores them in column-major order, whereas Numpy by default stores them in row-major order. The internal memory representation is preserved when writing to HDF5, so matrices end up transposed, unless we transpose them ourselves prior to writing.

Please try this patch:

index 9565f12..7d30222 100644
--- a/R/WriteH5MU.R
+++ b/R/WriteH5MU.R
@@ -83,7 +83,7 @@ WriteH5ADHelper <- function(object, assay, root, global = FALSE) {
               write_sparse_matrix(data_group, x_data, sparse_type)
             } else {
               # dense matrix
-              raw_group$create_dataset("X", x_data)
+              raw_group$create_dataset("X", t(x_data))
             }
             # .raw has to contain .var as well
             raw_var_group <- raw_group$create_group("var")

If this doesn't work, a sample of a Seurat object that has this problem would help.

I've tested the patch but I still see the same error. Strangely, the error disappears upon forcing my ADT matrix to a sparse matrix using Seurat::as.sparse. Now I can load my mudata file.