scverse/anndata

Using `Pint` for units

ilan-gold opened this issue · 9 comments

Please describe your wishes and possible alternatives to achieve the desired result.

Continuing the discussion from #1481, we are considering Pint as both an in-memory data structure and something serializable.

It seems there are a few things that would have to be worked out?

  1. Is there a global "AnnData registry" so that objects are interoperable with one another? Could a context manager e.g., from settings help override behavior as needed?
  2. What would the relationship (if any) of this be to SpatialData
  3. What should the behavior of operations like concat be which involve more than one object?

cc @mruffalo @ivirshup @flying-sheep

I would like to add here that I could see this as an enhancement for spatialdata. The two need not be opposed. I think it's super cool to be able to see units on arrays and is super useful for obvious readability reasons.

In terms of the issues at play, I don't think merging two different things with Pint-typed arrays i.e., in obsm should do anything i.e., harmonization of coordinates for different omics experiments. I am not even sure the registries need to be compatible. To this end, I think it could be worth doing some sort of versioning of the registries as well (even though they are instantiated). The only reason I could see them changing is if we introduce something like contexts for conversions across domains, although it seems unlikely we would use this.

This is also a use-case for ehrapy, where we're constantly dealing with various units for measurements

Thanks very much @ilan-gold!

I don't have much of an opinion about SpatialData yet, nor am I too familiar with its transformation methods and coordinate systems. If it isn't already using Pint, I'd expect that to be a significant improvement there also.

For your other items:

1: It's probably no surprise, but I advocate for the approach I took in #1481, in which all AnnData objects instantiated by the package use the same UnitRegistry object stored in some easily-accessible place. Users could adjust that registry's defaults, add new units, whatever they'd like. In my opinion this is the best way to ensure that users see the addition of physical units as an enhancement, making it convenient to handle physical quantities properly -- rather than an obstacle, especially when loading multiple AnnData objects from disk for integrative analysis.

3: anndata.concat already concatenates matrices in .obsm when present in multiple objects, with potentially-meaningless results for things like PCA or UMAP coordinates. If this is to be reconsidered, I strongly believe this should be done independently of adding support for physical units.

I consider this code snippet, included in #1481 and duplicated here for convenience, to be purely an improvement:

In [1]: import numpy as np

In [2]: from pint import UnitRegistry

In [3]: reg = UnitRegistry()

In [4]: coords_1 = np.array([[1, 2], [2, 1]]) * reg.mm

In [5]: coords_1
Out[5]: 
array([[1, 2],
       [2, 1]]) <Unit('millimeter')>

In [6]: coords_2 = np.array([[0, 3], [4, 0]]) * reg.um

In [7]: coords_2
Out[7]: 
array([[0, 3],
       [4, 0]]) <Unit('micrometer')>

In [8]: np.vstack([coords_1, coords_2])
Out[8]: 
array([[1.   , 2.   ],
       [2.   , 1.   ],
       [0.   , 0.003],
       [0.004, 0.   ]]) <Unit('millimeter')>

Any other stakeholders, or opinions about the implementation in the draft PR?

Does anyone else agree or disagree that these items should be considered separately:

  • Adding support for physical units
  • Disallowing concatenation of matrices in .obsm

I think a global unit registry instantiated by anndata makes it much easier to use units and potentially compare across AnnData objects, but the documentation will need to be clear about the existence and intended usage of anndata.units.

Disallowing concatenation of matrices in .obsm

After reading your above thoughts, and also thinking about it a bit, I don't think this is such a big deal. Allowing concatenation is fine, because people have to opt-in to using units (and there is already some meaninglessness in the operation on occasion as you mention).

I am more curious what perhaps should happen if one array has units and the other doesn't. What's the result then? I think this should be resolved in the PR.

I am more curious what perhaps should happen if one array has units and the other doesn't. What's the result then? I think this should be resolved in the PR.

Independent of whether we even allow the operation, this problem would need to be resolved.

Thanks for the feedback @ilan-gold!

Combining arrays with and without units is not allowed -- Pint throws an exception.

This:

>>> import numpy as np
>>> from pint import UnitRegistry
>>> reg = UnitRegistry()
>>> a1 = np.array([[1, 2], [3, 4]])
>>> a2 = np.array([[0, 1], [2, 3]]) * reg.mm
>>> np.vstack([a1, a2])

results in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 180, in vstack

... uninformative stack frames omitted ...

  File "/Users/mruffalo/.opt/anaconda3/lib/python3.11/site-packages/pint/facets/numpy/numpy_func.py", line 86, in <listcomp>
    return [convert_arg(item, pre_calc_units) for item in arg]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mruffalo/.opt/anaconda3/lib/python3.11/site-packages/pint/facets/numpy/numpy_func.py", line 93, in convert_arg
    raise DimensionalityError("dimensionless", pre_calc_units)
pint.errors.DimensionalityError: Cannot convert from 'dimensionless' to 'millimeter'

Great @mruffalo - I think you have answered most of my questions then. I think I have a few comments open on the PR. Another thing I'd be curious about is how this works with dask and scipy sparse. Probably want to add an i/o test there for that and/or try out locally something like to_zarr from dask. These aren't really concerns, more just things we should be clear about to end-users. I would be fine moving back over to the PR now.