scverse/scanpy

"invalid shape in fixed-type tuple" when saving adata after ranking genes

dylkot opened this issue · 9 comments

I'm using Scanpy with the following software versions:

python==3.7
scanpy==1.4.4
numpy==1.17.2
anndata==0.6.22.post1

on Ubuntu 18.04. I am able to save my AnnData object just fine with

sc.write(results_file, adata)

and to load it again with

adata = sc.read(results_file)

however if I save it after I run the command

sc.tl.rank_genes_groups(adata, 'louvain12_lab', method='wilcoxon')

the AnnData object will save but when I try to reload it, I get an error message:

ValueError                                Traceback (most recent call last)
<ipython-input-141-159082f1696f> in <module>
      1 results_file = os.path.join(adir, '{project}.count_{count}.gene_{gene}.mito_{mito}.HVGs_{nhvgs}.TPT.{log}.scale.TEST.h5ad'.format(project=project_name, count=count_thresh, gene=gene_thresh, mito=mitothresh, nhvgs=nhvgs, log=logstatus))
      2 print(results_file)
----> 3 adata = sc.read(results_file)

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/scanpy/readwrite.py in read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, **kwargs)
     95             filename, backed=backed, sheet=sheet, ext=ext,
     96             delimiter=delimiter, first_column_names=first_column_names,
---> 97             backup_url=backup_url, cache=cache, **kwargs,
     98         )
     99     # generate filename and read to dict

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/scanpy/readwrite.py in _read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, suppress_cache_warning, **kwargs)
    497     if ext in {'h5', 'h5ad'}:
    498         if sheet is None:
--> 499             return read_h5ad(filename, backed=backed)
    500         else:
    501             logg.debug(f'reading sheet {sheet} from file {filename}')

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in read_h5ad(filename, backed, chunk_size)
    445     else:
    446         # load everything into memory
--> 447         constructor_args = _read_args_from_h5ad(filename=filename, chunk_size=chunk_size)
    448         X = constructor_args[0]
    449         dtype = None

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_args_from_h5ad(adata, filename, mode, chunk_size)
    484             d[key] = None
    485         else:
--> 486             _read_key_value_from_h5(f, d, key, chunk_size=chunk_size)
    487     # backwards compat: save X with the correct name
    488     if 'X' not in d:

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_key_value_from_h5(f, d, key, key_write, chunk_size)
    508         d[key_write] = OrderedDict() if key == 'uns' else {}
    509         for k in f[key].keys():
--> 510             _read_key_value_from_h5(f, d[key_write], key + '/' + k, k, chunk_size)
    511         return
    512 

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_key_value_from_h5(f, d, key, key_write, chunk_size)
    508         d[key_write] = OrderedDict() if key == 'uns' else {}
    509         for k in f[key].keys():
--> 510             _read_key_value_from_h5(f, d[key_write], key + '/' + k, k, chunk_size)
    511         return
    512 

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_key_value_from_h5(f, d, key, key_write, chunk_size)
    542         return key, value
    543 
--> 544     key, value = postprocess_reading(key, value)
    545     d[key_write] = value
    546     return

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in postprocess_reading(key, value)
    539             new_dtype = [((dt[0], 'U{}'.format(int(int(dt[1][2:])/4)))
    540                           if dt[1][1] == 'S' else dt) for dt in value.dtype.descr]
--> 541             value = value.astype(new_dtype)
    542         return key, value
    543 

ValueError: invalid shape in fixed-type tuple.

Any idea what is going on or what I can do to make it past this error? It only started happening after I updated my operating system to Ubuntu 18 and my Python to 3.7 and reinstalled scanpy from conda.

Hi Dylan,

This is an issue with the new h5py package, which @ivirshup already fixed on master (928d475). For now, you can downgrade your h5py package to 2.9.0 using pip install h5py==2.9.0 as a workaround.

@gokceneraslan - thanks for the fast response. This broke our (cellxgene) travis pipeline as well. Do you have any info on eta for a fix/workaround other than pinning the module version? TY!

I'm having the same error with h5py==2.9.0. Cellxgene doesn't seem to be working with the object that I created the object with scanpy 1.4.3+116.g0075c62. I can however load it again with that version. But when I downgrade to 1.3.7 (recommendation from @mbuttner who had the same cellxgene issue) I can no longer load the object and get the above error.

Back in the 1.4.3 dev version scanpy it no longer writes the object after loading, and gives me the following error:

In [23]: adata.write("cellxgene.h5ad")                                                                                                                                                         
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-33b15d710f71> in <module>
----> 1 adata.write("cellxgene.h5ad")

~/new_anndata/anndata/anndata/core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense)
   2222             compression=compression,
   2223             compression_opts=compression_opts,
-> 2224             force_dense=force_dense,
   2225         )
   2226 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_h5ad(filepath, adata, force_dense, dataset_kwargs, **kwargs)
     90         write_attribute(f, "varp", adata.varp, dataset_kwargs)
     91         write_attribute(f, "layers", adata.layers, dataset_kwargs)
---> 92         write_attribute(f, "uns", adata.uns, dataset_kwargs)
     93         write_attribute(f, "raw", adata.raw, dataset_kwargs)
     94     if adata.isbacked:

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_attribute(f, key, value, dataset_kwargs)
    103     if key in f:
    104         del f[key]
--> 105     _write_method(type(value))(f, key, value, dataset_kwargs)
    106 
    107 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_mapping(f, key, value, dataset_kwargs)
    203 def write_mapping(f, key, value, dataset_kwargs=MappingProxyType({})):
    204     for sub_key, sub_value in value.items():
--> 205         write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs)
    206 
    207 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_attribute(f, key, value, dataset_kwargs)
    103     if key in f:
    104         del f[key]
--> 105     _write_method(type(value))(f, key, value, dataset_kwargs)
    106 
    107 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_mapping(f, key, value, dataset_kwargs)
    203 def write_mapping(f, key, value, dataset_kwargs=MappingProxyType({})):
    204     for sub_key, sub_value in value.items():
--> 205         write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs)
    206 
    207 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_attribute(f, key, value, dataset_kwargs)
    103     if key in f:
    104         del f[key]
--> 105     _write_method(type(value))(f, key, value, dataset_kwargs)
    106 
    107 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_array(f, key, value, dataset_kwargs)
    152     elif value.dtype.names is not None:
    153         value = _to_hdf5_vlen_strings(value)
--> 154     f.create_dataset(key, data=value, **dataset_kwargs)
    155 
    156 

~/new_anndata/anndata/anndata/h5py/h5sparse.py in create_dataset(self, name, data, chunk_size, **kwargs)
    151         if not isinstance(data, SparseDataset) and not ss.issparse(data):
    152             return self.h5py_group.create_dataset(
--> 153                 name=name, data=data, **kwargs
    154             )
    155         if self.force_dense:

~/anaconda3/envs/sc-tutorial/lib/python3.6/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    134 
    135         with phil:
--> 136             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    137             dset = dataset.Dataset(dsid)
    138             if name is not None:

~/anaconda3/envs/sc-tutorial/lib/python3.6/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl)
    116         else:
    117             dtype = numpy.dtype(dtype)
--> 118         tid = h5t.py_create(dtype, logical=1)
    119 
    120     # Legacy

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t._c_compound()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Everything however seems to work fine when I throw out the rank_genes_groups results from adata.uns

Edit: actually cellxgene still isn't working, but I could at least save again.

@LuckyMD

I can replicate that with:

import scanpy as sc
pbmc = sc.datasets.pbmc68k_reduced()
pbmc.write("tmp.h5ad")
fromdisk = sc.read("tmp.h5ad")  # Do we read okay
fromdisk.write(pbmc)  # Can we round trip

Some context around this, and my current thinking on a solution:

  • h5py doesn't do fixed length unicode strings
  • h5py does do variable length unicode strings, pretty much anywhere
  • zarr doesn't do variable length strings in structured arrays
  • We probably don't actually want to use fixed length unicode strings much. Bytestrings, more likely.
  • We can probably just add another element type to allow special handling for these. I think it'd be fine to not do np.str_ type arrays.

Sorry for the lack of minimal reproducible example... and thanks for creating one :).

We probably don't actually want to use fixed length unicode strings much. Bytestrings, more likely.

Where might bytestrings be useful? If you say text, I have a very strong opposing opinion as I’m a survivor of Python 2 and don’t want to see an UnicodeDecodeError in my life again 😉

It is, too bad numpy has no good variable-length string array type.

When would bytes make sense? Bytes just mean “data, but I don’t know its structure or am about to write it to disk”