"invalid shape in fixed-type tuple" when saving adata after ranking genes
dylkot opened this issue · 9 comments
I'm using Scanpy with the following software versions:
python==3.7
scanpy==1.4.4
numpy==1.17.2
anndata==0.6.22.post1
on Ubuntu 18.04. I am able to save my AnnData object just fine with
sc.write(results_file, adata)
and to load it again with
adata = sc.read(results_file)
however if I save it after I run the command
sc.tl.rank_genes_groups(adata, 'louvain12_lab', method='wilcoxon')
the AnnData object will save but when I try to reload it, I get an error message:
ValueError Traceback (most recent call last)
<ipython-input-141-159082f1696f> in <module>
1 results_file = os.path.join(adir, '{project}.count_{count}.gene_{gene}.mito_{mito}.HVGs_{nhvgs}.TPT.{log}.scale.TEST.h5ad'.format(project=project_name, count=count_thresh, gene=gene_thresh, mito=mitothresh, nhvgs=nhvgs, log=logstatus))
2 print(results_file)
----> 3 adata = sc.read(results_file)
/opt/miniconda3/envs/py37/lib/python3.7/site-packages/scanpy/readwrite.py in read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, **kwargs)
95 filename, backed=backed, sheet=sheet, ext=ext,
96 delimiter=delimiter, first_column_names=first_column_names,
---> 97 backup_url=backup_url, cache=cache, **kwargs,
98 )
99 # generate filename and read to dict
/opt/miniconda3/envs/py37/lib/python3.7/site-packages/scanpy/readwrite.py in _read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, suppress_cache_warning, **kwargs)
497 if ext in {'h5', 'h5ad'}:
498 if sheet is None:
--> 499 return read_h5ad(filename, backed=backed)
500 else:
501 logg.debug(f'reading sheet {sheet} from file {filename}')
/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in read_h5ad(filename, backed, chunk_size)
445 else:
446 # load everything into memory
--> 447 constructor_args = _read_args_from_h5ad(filename=filename, chunk_size=chunk_size)
448 X = constructor_args[0]
449 dtype = None
/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_args_from_h5ad(adata, filename, mode, chunk_size)
484 d[key] = None
485 else:
--> 486 _read_key_value_from_h5(f, d, key, chunk_size=chunk_size)
487 # backwards compat: save X with the correct name
488 if 'X' not in d:
/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_key_value_from_h5(f, d, key, key_write, chunk_size)
508 d[key_write] = OrderedDict() if key == 'uns' else {}
509 for k in f[key].keys():
--> 510 _read_key_value_from_h5(f, d[key_write], key + '/' + k, k, chunk_size)
511 return
512
/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_key_value_from_h5(f, d, key, key_write, chunk_size)
508 d[key_write] = OrderedDict() if key == 'uns' else {}
509 for k in f[key].keys():
--> 510 _read_key_value_from_h5(f, d[key_write], key + '/' + k, k, chunk_size)
511 return
512
/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in _read_key_value_from_h5(f, d, key, key_write, chunk_size)
542 return key, value
543
--> 544 key, value = postprocess_reading(key, value)
545 d[key_write] = value
546 return
/opt/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/readwrite/read.py in postprocess_reading(key, value)
539 new_dtype = [((dt[0], 'U{}'.format(int(int(dt[1][2:])/4)))
540 if dt[1][1] == 'S' else dt) for dt in value.dtype.descr]
--> 541 value = value.astype(new_dtype)
542 return key, value
543
ValueError: invalid shape in fixed-type tuple.
Any idea what is going on or what I can do to make it past this error? It only started happening after I updated my operating system to Ubuntu 18 and my Python to 3.7 and reinstalled scanpy from conda.
@gokceneraslan - thanks for the fast response. This broke our (cellxgene) travis pipeline as well. Do you have any info on eta for a fix/workaround other than pinning the module version? TY!
I'm having the same error with h5py==2.9.0
. Cellxgene doesn't seem to be working with the object that I created the object with scanpy 1.4.3+116.g0075c62
. I can however load it again with that version. But when I downgrade to 1.3.7 (recommendation from @mbuttner who had the same cellxgene issue) I can no longer load the object and get the above error.
Back in the 1.4.3 dev version scanpy it no longer writes the object after loading, and gives me the following error:
In [23]: adata.write("cellxgene.h5ad")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-23-33b15d710f71> in <module>
----> 1 adata.write("cellxgene.h5ad")
~/new_anndata/anndata/anndata/core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense)
2222 compression=compression,
2223 compression_opts=compression_opts,
-> 2224 force_dense=force_dense,
2225 )
2226
~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_h5ad(filepath, adata, force_dense, dataset_kwargs, **kwargs)
90 write_attribute(f, "varp", adata.varp, dataset_kwargs)
91 write_attribute(f, "layers", adata.layers, dataset_kwargs)
---> 92 write_attribute(f, "uns", adata.uns, dataset_kwargs)
93 write_attribute(f, "raw", adata.raw, dataset_kwargs)
94 if adata.isbacked:
~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_attribute(f, key, value, dataset_kwargs)
103 if key in f:
104 del f[key]
--> 105 _write_method(type(value))(f, key, value, dataset_kwargs)
106
107
~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_mapping(f, key, value, dataset_kwargs)
203 def write_mapping(f, key, value, dataset_kwargs=MappingProxyType({})):
204 for sub_key, sub_value in value.items():
--> 205 write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs)
206
207
~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_attribute(f, key, value, dataset_kwargs)
103 if key in f:
104 del f[key]
--> 105 _write_method(type(value))(f, key, value, dataset_kwargs)
106
107
~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_mapping(f, key, value, dataset_kwargs)
203 def write_mapping(f, key, value, dataset_kwargs=MappingProxyType({})):
204 for sub_key, sub_value in value.items():
--> 205 write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs)
206
207
~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_attribute(f, key, value, dataset_kwargs)
103 if key in f:
104 del f[key]
--> 105 _write_method(type(value))(f, key, value, dataset_kwargs)
106
107
~/new_anndata/anndata/anndata/readwrite/h5ad.py in write_array(f, key, value, dataset_kwargs)
152 elif value.dtype.names is not None:
153 value = _to_hdf5_vlen_strings(value)
--> 154 f.create_dataset(key, data=value, **dataset_kwargs)
155
156
~/new_anndata/anndata/anndata/h5py/h5sparse.py in create_dataset(self, name, data, chunk_size, **kwargs)
151 if not isinstance(data, SparseDataset) and not ss.issparse(data):
152 return self.h5py_group.create_dataset(
--> 153 name=name, data=data, **kwargs
154 )
155 if self.force_dense:
~/anaconda3/envs/sc-tutorial/lib/python3.6/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
134
135 with phil:
--> 136 dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
137 dset = dataset.Dataset(dsid)
138 if name is not None:
~/anaconda3/envs/sc-tutorial/lib/python3.6/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl)
116 else:
117 dtype = numpy.dtype(dtype)
--> 118 tid = h5t.py_create(dtype, logical=1)
119
120 # Legacy
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t._c_compound()
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
Everything however seems to work fine when I throw out the rank_genes_groups results from adata.uns
Edit: actually cellxgene still isn't working, but I could at least save again.
I can replicate that with:
import scanpy as sc
pbmc = sc.datasets.pbmc68k_reduced()
pbmc.write("tmp.h5ad")
fromdisk = sc.read("tmp.h5ad") # Do we read okay
fromdisk.write(pbmc) # Can we round trip
Some context around this, and my current thinking on a solution:
- h5py doesn't do fixed length unicode strings
- h5py does do variable length unicode strings, pretty much anywhere
- zarr doesn't do variable length strings in structured arrays
- We probably don't actually want to use fixed length unicode strings much. Bytestrings, more likely.
- We can probably just add another element type to allow special handling for these. I think it'd be fine to not do
np.str_
type arrays.
Sorry for the lack of minimal reproducible example... and thanks for creating one :).
We probably don't actually want to use fixed length unicode strings much. Bytestrings, more likely.
Where might bytestrings be useful? If you say text, I have a very strong opposing opinion as I’m a survivor of Python 2 and don’t want to see an UnicodeDecodeError in my life again 😉
It is, too bad numpy has no good variable-length string array type.
When would bytes make sense? Bytes just mean “data, but I don’t know its structure or am about to write it to disk”