PMBio/MuDataSeurat

String categories written by MuDataSeurat are read in as bytes by anndata

ivirshup opened this issue · 0 comments

Using the same setup in #5, with the fix that closed it:

suppressWarnings(SeuratData::InstallData("pbmc3k", force.reinstall = F))
suppressWarnings(data("pbmc3k"))
seuratObj <- suppressWarnings(pbmc3k)

WriteH5AD(seuratObj, "mudata_seurat.h5ad")
import anndata as ad

a = ad.read_h5ad("./mudata_seurat.h5ad")
a.obs
              orig.ident  nCount_RNA  nFeature_RNA seurat_annotations
AAACATACAACCAC  b'pbmc3k'      2419.0           779    b'Memory CD4 T'
AAACATTGAGCTAC  b'pbmc3k'      4903.0          1352               b'B'
AAACATTGATCAGC  b'pbmc3k'      3147.0          1129    b'Memory CD4 T'
AAACCGTGCTTCCG  b'pbmc3k'      2639.0           960      b'CD14+ Mono'
AAACCGTGTATGCG  b'pbmc3k'       980.0           521              b'NK'
...                   ...         ...           ...                ...
TTTCGAACTCTCAT  b'pbmc3k'      3459.0          1153      b'CD14+ Mono'
TTTCTACTGAGGCA  b'pbmc3k'      3443.0          1224               b'B'
TTTCTACTTCCTCG  b'pbmc3k'      1684.0           622               b'B'
TTTGCATGAGAGGC  b'pbmc3k'      1022.0           452               b'B'
TTTGCATGCCTCAC  b'pbmc3k'      1984.0           723     b'Naive CD4 T'

The categorical should be read in as strings. I would also suggest just writing the more recent dataframe and categorical format where everything is more self contained and annotated while you're at it.