Older versions of anndata throw unintuitive errors when trying to read newer formats

Question

Older versions of anndata throw unintuitive errors when trying to read newer formats

Closed this issue a year ago · 11 comments

Hi,

I wanted to get help on an error reading h5ads created by the 0.8.0rc version of anndata. In my experience, h5ads that are created using 0.8.0rc1 cannot be opened using older anndata versions.

How to reproduce

In an environment with 0.8.0rc1 installed:

import scanpy as sc
adata = sc.datasets.pbmc3k()
adata.write_h5ad("adata.0.8.h5ad")

In an environment with 0.7.* installed (tested with 0.7.6 and 0.7.8)

from anndata import read_h5ad
ad = read_h5ad("adata.0.8.h5ad")

You get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, *args, **kwargs)
    176         try:
--> 177             return func(elem, *args, **kwargs)
    178         except Exception as e:

/opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/h5ad.py in read_group(group)
    526     if encoding_type:
--> 527         EncodingVersions[encoding_type].check(
    528             group.name, group.attrs["encoding-version"]

/opt/conda/envs/py37good/lib/python3.7/enum.py in __getitem__(cls, name)
    356     def __getitem__(cls, name):
--> 357         return cls._member_map_[name]
    358 

KeyError: 'dict'

During handling of the above exception, another exception occurred:

AnnDataReadError                          Traceback (most recent call last)
~/tmp/ipykernel_17002/906833588.py in <module>
----> 1 ad = read_h5ad("adata.0.8.h5ad")

/opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    419                 d[k] = read_dataframe(f[k])
    420             else:  # Base case
--> 421                 d[k] = read_attribute(f[k])
    422 
    423         d["raw"] = _read_raw(f, as_sparse, rdasp)

/opt/conda/envs/py37good/lib/python3.7/functools.py in wrapper(*args, **kw)
    838                             '1 positional argument')
    839 
--> 840         return dispatch(args[0].__class__)(*args, **kw)
    841 
    842     funcname = getattr(func, '__name__', 'singledispatch function')

/opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, *args, **kwargs)
    182                 parent = _get_parent(elem)
    183                 raise AnnDataReadError(
--> 184                     f"Above error raised while reading key {elem.name!r} of "
    185                     f"type {type(elem)} from {parent}."
    186                 )

AnnDataReadError: Above error raised while reading key '/layers' of type <class 'h5py._hl.group.Group'> from /.

I'm using h5py==3.6.0. Let me know if you need me to list anything else about my environment.

Thanks!

Answer 1 · 2022-02-01T17:10:45.000Z

Hey, this is expected. What you're looking for would be forward compatibility.

Sometime we update the format of an AnnData objects stored on disk. We can't really make older versions of the library know how to deal with this. We've actually added some internal features in the new version which should make having some form of forward compatibility easier in the future (even if it's just writing older versions of the schema).

Is there a reason you'd need to keep using older versions of the library once this is released?

Worst case we could make another release in the 0.7.x series with smaller forward compatible changes, but I'd need to know it's needed first.

Answer 2 · 2022-02-01T18:42:31.000Z

Hey, this is expected. What you're looking for would be forward compatibility.

Sometime we update the format of an AnnData objects stored on disk. We can't really make older versions of the library know how to deal with this. We've actually added some internal features in the new version which should make having some form of forward compatibility easier in the future (even if it's just writing older versions of the schema).

Is there a reason you'd need to keep using older versions of the library once this is released?

Worst case we could make another release in the 0.7.x series with smaller forward compatible changes, but I'd need to know it's needed first.

Thanks for your thoughtful response. This is indeed a big concern for us. We have a substantial amount of infrastructure that's using h5ads and we can't always upgrade everything in tandem. In addition, changes to the h5ad file format can break external tools, eg R code that is reading from these files using R hdf5 libraries.

I understand it's useful to make changes to the h5ad file format periodically to make it better, but I'd suggest a few things to make sure doing so doesn't break the whole ecosystem:

Embed a file format version that would be surfaced during any reading errors -- it should be possible to warn users that they're using an outdated anndata version.
Only make breaking changes in major version upgrades (I suppose 0.8 would be a major version).
Carefully document any potentially breaking changes to the file format in the version notes. While the current version documentation indicates some file format changes, it's hard to see how the above error about layers relates to the version notes about file formats. I would explicitly state in the IO Specification section of the release notes that files written by anndata>=0.8.0 won't be readable by anndata<0.8.0. (Right now it says "Internal handling of IO has been overhauled." which suggests the file format is consistent while read/write logic has changed.)
Write out a full document spec, eg what h5 slots have what in them (I know this is a heavy lift).

Moving forward, I'd recommend:

Adding an explicit file format version to h5ad
Cutting a 0.7.9 release that's backwards compatible but also capable of reading the version string and generating errors when new file formats are being read.

cc @gdesmarais-ctx

Answer 3 · 2022-02-02T17:53:39.000Z

Thanks for all the information!

A number of the issues you raise are actually topics we're trying to address right now (and this release provides some solutions for), but it's very useful to get feedback on our approach.

In addition, changes to the h5ad file format can break external tools, eg R code that is reading from these files using R hdf5 libraries.

Very aware of this. We're going for a fairly long release candidate version cycle (1 month at least) to make sure downstream packages have time to fix compatibility or at least pin dependencies/ error gracefully.

Moving forward, we're looking at having selected set of tools to run integration tests against – but this will take some time/ resources.

Embed a file format version that would be surfaced during any reading errors -- it should be possible to warn users that they're using an outdated anndata version.

The file format version is something that's new this version!

How and when to warn users is an interesting issue though. This version throws a warning for very old anndata versions where we still have to "just know" how each element should be read in. But at how old do we need to warn (and how loudly)?

Can be more explicit about this, #699

Write out a full document spec, eg what h5 slots have what in them (I know this is a heavy lift).

I am interested in having something more formal here. Possibly a bike shed schema?

At the moment we have the on-disk format page in the docs. This does have information about every current encoding type (and has been updated for this release), but I haven't figured out a good way to present past encoding information. Recommendations welcome!

Cutting a 0.7.9 release

Will look into this. Another thing that has changed during this release cycle is us making a system for being able to have feature and bug fix branches. So, there may be unforeseen difficulties doing anything other than a commit off the last release.

Answer 4 · 2022-02-02T17:59:54.000Z

Thanks, @ivirshup for your detailed response!

I should have started with: I know you're mostly single-handedly holding down the fort on anndata and we greatly appreciate your continued development here.

I don't think we have a lot of spare capacity right now to help with implementation but would be happy to provide feedback on planning and PRs.

cc @ryan-williams

Answer 5 · 2022-08-10T21:04:48.000Z

Hey there, I recently noticed this issue. It's unfortunate and I think will stand in the way of wider AnnData adoption if not properly addressed, which would be a shame. I really like the AnnData abstraction and I'd like to see it stick around. Let me know if there's anything I can do, I'm a software engineer and I have some bandwidth to help contribute

Answer 6 · 2022-08-10T22:54:29.000Z

@ivirshup - you wrote in an earlier comment:

The file format version is something that's new this version!

Other than introspecting on the encoding versions in the on-disk file format, is there a file format version that can be inspected for anndata 0.8? We want to be able to enforce anndata 0.8 for datasets being submitted to the cellxgene data portal in a future release.

Answer 7 · 2022-08-11T10:49:28.000Z

It’s addressed here: #734

@ivirshup, could you please respond to me there?

Answer 8 · 2022-09-07T12:39:18.000Z

@ivirshup, one more consideration is adopting "encoder" and "encoder-version" that mudata has.

Answer 9 · 2023-06-21T02:23:15.000Z

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

Answer 10 · 2023-06-21T15:24:10.000Z

I believe this has now been addressed for future versions of anndata through our encoding mechanism, so will close this.

Answer 11 · 2023-06-22T08:26:53.000Z

Not only future ones, #734 ended up in 0.9.0