Older versions of anndata throw unintuitive errors when trying to read newer formats
Closed this issue · 11 comments
Hi,
I wanted to get help on an error reading h5ads created by the 0.8.0rc
version of anndata. In my experience, h5ads that are created using 0.8.0rc1
cannot be opened using older anndata
versions.
How to reproduce
-
In an environment with
0.8.0rc1
installed:import scanpy as sc adata = sc.datasets.pbmc3k() adata.write_h5ad("adata.0.8.h5ad")
-
In an environment with
0.7.*
installed (tested with 0.7.6 and 0.7.8)from anndata import read_h5ad ad = read_h5ad("adata.0.8.h5ad")
You get the following error:
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, *args, **kwargs) 176 try: --> 177 return func(elem, *args, **kwargs) 178 except Exception as e: /opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/h5ad.py in read_group(group) 526 if encoding_type: --> 527 EncodingVersions[encoding_type].check( 528 group.name, group.attrs["encoding-version"] /opt/conda/envs/py37good/lib/python3.7/enum.py in __getitem__(cls, name) 356 def __getitem__(cls, name): --> 357 return cls._member_map_[name] 358 KeyError: 'dict' During handling of the above exception, another exception occurred: AnnDataReadError Traceback (most recent call last) ~/tmp/ipykernel_17002/906833588.py in <module> ----> 1 ad = read_h5ad("adata.0.8.h5ad") /opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size) 419 d[k] = read_dataframe(f[k]) 420 else: # Base case --> 421 d[k] = read_attribute(f[k]) 422 423 d["raw"] = _read_raw(f, as_sparse, rdasp) /opt/conda/envs/py37good/lib/python3.7/functools.py in wrapper(*args, **kw) 838 '1 positional argument') 839 --> 840 return dispatch(args[0].__class__)(*args, **kw) 841 842 funcname = getattr(func, '__name__', 'singledispatch function') /opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, *args, **kwargs) 182 parent = _get_parent(elem) 183 raise AnnDataReadError( --> 184 f"Above error raised while reading key {elem.name!r} of " 185 f"type {type(elem)} from {parent}." 186 ) AnnDataReadError: Above error raised while reading key '/layers' of type <class 'h5py._hl.group.Group'> from /.
I'm using h5py==3.6.0
. Let me know if you need me to list anything else about my environment.
Thanks!
Hey, this is expected. What you're looking for would be forward compatibility.
Sometime we update the format of an AnnData objects stored on disk. We can't really make older versions of the library know how to deal with this. We've actually added some internal features in the new version which should make having some form of forward compatibility easier in the future (even if it's just writing older versions of the schema).
Is there a reason you'd need to keep using older versions of the library once this is released?
Worst case we could make another release in the 0.7.x series with smaller forward compatible changes, but I'd need to know it's needed first.
Hey, this is expected. What you're looking for would be forward compatibility.
Sometime we update the format of an AnnData objects stored on disk. We can't really make older versions of the library know how to deal with this. We've actually added some internal features in the new version which should make having some form of forward compatibility easier in the future (even if it's just writing older versions of the schema).
Is there a reason you'd need to keep using older versions of the library once this is released?
Worst case we could make another release in the 0.7.x series with smaller forward compatible changes, but I'd need to know it's needed first.
Thanks for your thoughtful response. This is indeed a big concern for us. We have a substantial amount of infrastructure that's using h5ads and we can't always upgrade everything in tandem. In addition, changes to the h5ad file format can break external tools, eg R code that is reading from these files using R hdf5 libraries.
I understand it's useful to make changes to the h5ad file format periodically to make it better, but I'd suggest a few things to make sure doing so doesn't break the whole ecosystem:
- Embed a file format version that would be surfaced during any reading errors -- it should be possible to warn users that they're using an outdated anndata version.
- Only make breaking changes in major version upgrades (I suppose 0.8 would be a major version).
- Carefully document any potentially breaking changes to the file format in the version notes. While the current version documentation indicates some file format changes, it's hard to see how the above error about layers relates to the version notes about file formats. I would explicitly state in the
IO Specification
section of the release notes that files written by anndata>=0.8.0 won't be readable by anndata<0.8.0. (Right now it says "Internal handling of IO has been overhauled." which suggests the file format is consistent while read/write logic has changed.) - Write out a full document spec, eg what h5 slots have what in them (I know this is a heavy lift).
Moving forward, I'd recommend:
- Adding an explicit file format version to h5ad
- Cutting a 0.7.9 release that's backwards compatible but also capable of reading the version string and generating errors when new file formats are being read.
Thanks for all the information!
A number of the issues you raise are actually topics we're trying to address right now (and this release provides some solutions for), but it's very useful to get feedback on our approach.
In addition, changes to the h5ad file format can break external tools, eg R code that is reading from these files using R hdf5 libraries.
Very aware of this. We're going for a fairly long release candidate version cycle (1 month at least) to make sure downstream packages have time to fix compatibility or at least pin dependencies/ error gracefully.
Moving forward, we're looking at having selected set of tools to run integration tests against – but this will take some time/ resources.
- Embed a file format version that would be surfaced during any reading errors -- it should be possible to warn users that they're using an outdated anndata version.
The file format version is something that's new this version!
How and when to warn users is an interesting issue though. This version throws a warning for very old anndata versions where we still have to "just know" how each element should be read in. But at how old do we need to warn (and how loudly)?
Can be more explicit about this, #699
- Write out a full document spec, eg what h5 slots have what in them (I know this is a heavy lift).
I am interested in having something more formal here. Possibly a bike shed schema?
At the moment we have the on-disk format page in the docs. This does have information about every current encoding type (and has been updated for this release), but I haven't figured out a good way to present past encoding information. Recommendations welcome!
Cutting a 0.7.9 release
Will look into this. Another thing that has changed during this release cycle is us making a system for being able to have feature and bug fix branches. So, there may be unforeseen difficulties doing anything other than a commit off the last release.
Thanks, @ivirshup for your detailed response!
I should have started with: I know you're mostly single-handedly holding down the fort on anndata and we greatly appreciate your continued development here.
I don't think we have a lot of spare capacity right now to help with implementation but would be happy to provide feedback on planning and PRs.
Hey there, I recently noticed this issue. It's unfortunate and I think will stand in the way of wider AnnData adoption if not properly addressed, which would be a shame. I really like the AnnData abstraction and I'd like to see it stick around. Let me know if there's anything I can do, I'm a software engineer and I have some bandwidth to help contribute
@ivirshup - you wrote in an earlier comment:
The file format version is something that's new this version!
Other than introspecting on the encoding versions in the on-disk file format, is there a file format version that can be inspected for anndata 0.8? We want to be able to enforce anndata 0.8 for datasets being submitted to the cellxgene data portal in a future release.
@ivirshup, one more consideration is adopting "encoder"
and "encoder-version"
that mudata has.
This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!
I believe this has now been addressed for future versions of anndata through our encoding mechanism, so will close this.
Not only future ones, #734 ended up in 0.9.0