"invalid shape in fixed-type tuple" when saving adata after ranking genes

Question

"invalid shape in fixed-type tuple" when saving adata after ranking genes

dylkot opened this issue 5 years ago · 9 comments

I'm using Scanpy with the following software versions:

python==3.7
scanpy==1.4.4
numpy==1.17.2
anndata==0.6.22.post1

on Ubuntu 18.04. I am able to save my AnnData object just fine with

sc.write(results_file, adata)

and to load it again with

adata = sc.read(results_file)

however if I save it after I run the command

sc.tl.rank_genes_groups(adata, 'louvain12_lab', method='wilcoxon')

the AnnData object will save but when I try to reload it, I get an error message:

ValueError                                Traceback (most recent call last)
<ipython-input-141-159082f1696f> in <module>
      1 results_file = os.path.join(adir, '{project}.count_{count}.gene_{gene}.mito_{mito}.HVGs_{nhvgs}.TPT.{log}.scale.TEST.h5ad'.format(project=project_name, count=count_thresh, gene=gene_thresh, mito=mitothresh, nhvgs=nhvgs, log=logstatus))
      2 print(results_file)
----> 3 adata = sc.read(results_file)

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/scanpy/readwrite.py in read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, **kwargs)
     95             filename, backed=backed, sheet=sheet, ext=ext,
     96             delimiter=delimiter, first_column_names=first_column_names,
---> 97             backup_url=backup_url, cache=cache, **kwargs,
     98         )
     99     # generate filename and read to dict

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/scanpy/readwrite.py in _read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, suppress_cache_warning, **kwargs)
    497     if ext in {'h5', 'h5ad'}:
    498         if sheet is None:
--> 499             return read_h5ad(filename, backed=backed)
    500         else:
    501             logg.debug(f'reading sheet {sheet} from file {filename}')

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in read_h5ad(filename, backed, chunk_size)
    445     else:
    446         # load everything into memory
--> 447         constructor_args = _read_args_from_h5ad(filename=filename, chunk_size=chunk_size)
    448         X = constructor_args[0]
    449         dtype = None

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_args_from_h5ad(adata, filename, mode, chunk_size)
    484             d[key] = None
    485         else:
--> 486             _read_key_value_from_h5(f, d, key, chunk_size=chunk_size)
    487     # backwards compat: save X with the correct name
    488     if 'X' not in d:

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_key_value_from_h5(f, d, key, key_write, chunk_size)
    508         d[key_write] = OrderedDict() if key == 'uns' else {}
    509         for k in f[key].keys():
--> 510             _read_key_value_from_h5(f, d[key_write], key + '/' + k, k, chunk_size)
    511         return
    512 

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_key_value_from_h5(f, d, key, key_write, chunk_size)
    508         d[key_write] = OrderedDict() if key == 'uns' else {}
    509         for k in f[key].keys():
--> 510             _read_key_value_from_h5(f, d[key_write], key + '/' + k, k, chunk_size)
    511         return
    512 

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_key_value_from_h5(f, d, key, key_write, chunk_size)
    542         return key, value
    543 
--> 544     key, value = postprocess_reading(key, value)
    545     d[key_write] = value
    546     return

/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in postprocess_reading(key, value)
    539             new_dtype = [((dt[0], 'U{}'.format(int(int(dt[1][2:])/4)))
    540                           if dt[1][1] == 'S' else dt) for dt in value.dtype.descr]
--> 541             value = value.astype(new_dtype)
    542         return key, value
    543 

ValueError: invalid shape in fixed-type tuple.

Any idea what is going on or what I can do to make it past this error? It only started happening after I updated my operating system to Ubuntu 18 and my Python to 3.7 and reinstalled scanpy from conda.

Answer 1 · 2019-09-11T19:26:50.000Z

Hi Dylan,

This is an issue with the new h5py package, which @ivirshup already fixed on master (928d475). For now, you can downgrade your h5py package to 2.9.0 using pip install h5py==2.9.0 as a workaround.

Answer 2 · 2019-09-11T19:28:25.000Z

@gokceneraslan - thanks for the fast response. This broke our (cellxgene) travis pipeline as well. Do you have any info on eta for a fix/workaround other than pinning the module version? TY!

Answer 3 · 2019-09-12T04:28:06.000Z

Just a heads up, there is a remaining issue on anndata master where reading older files with h5py 2.10.0 results in bytestring indexes.

…

On Sep 12, 2019, at 05:28, Bruce Martin ***@***.***> wrote: @gokceneraslan - thanks for the fast response. This broke our (cellxgene) travis pipeline as well. Do you have any info on eta for a fix/workaround other than pinning the module version? TY! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 4 · 2019-10-22T13:43:09.000Z

I'm having the same error with h5py==2.9.0. Cellxgene doesn't seem to be working with the object that I created the object with scanpy 1.4.3+116.g0075c62. I can however load it again with that version. But when I downgrade to 1.3.7 (recommendation from @mbuttner who had the same cellxgene issue) I can no longer load the object and get the above error.

Back in the 1.4.3 dev version scanpy it no longer writes the object after loading, and gives me the following error:

In [23]: adata.write("cellxgene.h5ad")                                                                                                                                                         
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-33b15d710f71> in <module>
----> 1 adata.write("cellxgene.h5ad")

~/new_anndata/anndata/anndata/core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense)
   2222             compression=compression,
   2223             compression_opts=compression_opts,
-> 2224             force_dense=force_dense,
   2225         )
   2226 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_h5ad(filepath, adata, force_dense, dataset_kwargs, **kwargs)
     90         write_attribute(f, "varp", adata.varp, dataset_kwargs)
     91         write_attribute(f, "layers", adata.layers, dataset_kwargs)
---> 92         write_attribute(f, "uns", adata.uns, dataset_kwargs)
     93         write_attribute(f, "raw", adata.raw, dataset_kwargs)
     94     if adata.isbacked:

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_attribute(f, key, value, dataset_kwargs)
    103     if key in f:
    104         del f[key]
--> 105     _write_method(type(value))(f, key, value, dataset_kwargs)
    106 
    107 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_mapping(f, key, value, dataset_kwargs)
    203 def write_mapping(f, key, value, dataset_kwargs=MappingProxyType({})):
    204     for sub_key, sub_value in value.items():
--> 205         write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs)
    206 
    207 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_attribute(f, key, value, dataset_kwargs)
    103     if key in f:
    104         del f[key]
--> 105     _write_method(type(value))(f, key, value, dataset_kwargs)
    106 
    107 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_mapping(f, key, value, dataset_kwargs)
    203 def write_mapping(f, key, value, dataset_kwargs=MappingProxyType({})):
    204     for sub_key, sub_value in value.items():
--> 205         write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs)
    206 
    207 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_attribute(f, key, value, dataset_kwargs)
    103     if key in f:
    104         del f[key]
--> 105     _write_method(type(value))(f, key, value, dataset_kwargs)
    106 
    107 

~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_array(f, key, value, dataset_kwargs)
    152     elif value.dtype.names is not None:
    153         value = _to_hdf5_vlen_strings(value)
--> 154     f.create_dataset(key, data=value, **dataset_kwargs)
    155 
    156 

~/new_anndata/anndata/anndata/h5py/h5sparse.py in create_dataset(self, name, data, chunk_size, **kwargs)
    151         if not isinstance(data, SparseDataset) and not ss.issparse(data):
    152             return self.h5py_group.create_dataset(
--> 153                 name=name, data=data, **kwargs
    154             )
    155         if self.force_dense:

~/anaconda3/envs/sc-tutorial/lib/python3.6/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    134 
    135         with phil:
--> 136             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    137             dset = dataset.Dataset(dsid)
    138             if name is not None:

~/anaconda3/envs/sc-tutorial/lib/python3.6/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl)
    116         else:
    117             dtype = numpy.dtype(dtype)
--> 118         tid = h5t.py_create(dtype, logical=1)
    119 
    120     # Legacy

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t._c_compound()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Everything however seems to work fine when I throw out the rank_genes_groups results from adata.uns

Edit: actually cellxgene still isn't working, but I could at least save again.

Answer 5 · 2019-10-23T09:03:52.000Z

@LuckyMD

I can replicate that with:

import scanpy as sc
pbmc = sc.datasets.pbmc68k_reduced()
pbmc.write("tmp.h5ad")
fromdisk = sc.read("tmp.h5ad")  # Do we read okay
fromdisk.write(pbmc)  # Can we round trip

Some context around this, and my current thinking on a solution:

h5py doesn't do fixed length unicode strings
h5py does do variable length unicode strings, pretty much anywhere
zarr doesn't do variable length strings in structured arrays
We probably don't actually want to use fixed length unicode strings much. Bytestrings, more likely.
We can probably just add another element type to allow special handling for these. I think it'd be fine to not do np.str_ type arrays.

Answer 6 · 2019-10-23T09:15:06.000Z

Sorry for the lack of minimal reproducible example... and thanks for creating one :).

Answer 7 · 2019-10-23T11:53:57.000Z

We probably don't actually want to use fixed length unicode strings much. Bytestrings, more likely.

Where might bytestrings be useful? If you say text, I have a very strong opposing opinion as I’m a survivor of Python 2 and don’t want to see an UnicodeDecodeError in my life again 😉

Answer 8 · 2019-10-23T13:53:09.000Z

I think fixed length bytestrings would be useful when the data isn’t actually text. I think the assumption of a fixed length with text data, especially if it might have Unicode characters, is just asking for trouble.

Answer 9 · 2019-10-24T13:01:43.000Z

It is, too bad numpy has no good variable-length string array type.

When would bytes make sense? Bytes just mean “data, but I don’t know its structure or am about to write it to disk”