TGSAI/mdio-python

Add printable representation for MDIOReader and MDIOWriter

srib opened this issue ยท 10 comments

srib commented

Currently, MDIOReader and MDIOWriter prints the Python object. It would be useful to have a nice printable representation.

class InfoReporter:

    def __init__(self, obj):
        self.obj = obj

    def __repr__(self):
        items = self.obj.info_items()
        return info_text_report(items)

    def _repr_html_(self):
        items = self.obj.info_items()
        return info_html_report(items)

from here is a good model to follow.

Good idea.

pydata/xarray#1627

Another excellent example from Xarray. We could also consider using Xarray as a backend to reduce duplicate effort.

Xarray dev here! I just discovered this very cool project via twitter.

Would love to help you integrate Xarray into the library. I had a quick tour through the docs, and indeed it seems like Xarray could help reduce some boilerplate while bringing lots of features that would help your package. (It layers great on top of Zarr and Dask.) For an example of a domain-specific package built on top of these packages, check out sgkit.

Also tagging @TomNicholas, another Xarray dev with a big interest in energy. Let us know how we can help!

srib commented

@rabernat ๐Ÿ‘‹๐Ÿฝ!

Thank you for your interest and your generous offer to help us out. Will browse through sgkit as you suggested.

@tasansal

Hi @rabernat ๐Ÿ‘‹

Great to see you here! Big fan of your work.

We should collaborate! We are planning to have a steering committee pretty soon and would love you have you and more Xarray developers on board.

Our main intention is to have a energy domain specific library with some features similar Xarray and some extra domain specific features.

We have done a lot of heavy lifting incorporating exploration seismology data, and have some more implementations for wind resource data (to be integrated later).

That sounds like a great vision! Our goal in Xarray is to be a generic container for multi-dimensional labeled arrays with metadata. We'd love it if you could rely on Xarray as a base data container. We have a section in our docs on extending Xarray which explains how a third party package can add custom functionality to Xarray objects. We also have an entry points to implement your own backend for custom file formats. (Note that Xarray already has very strong support for Zarr I/O.)

If you feel like there are features missing from Xarray that are holding you back from adopting it internally, we would love to hear about it on our issue tracker.

@rabernat I am starting to look into the Xarray backend integration.

In our exploration seismology data case, we have groups of rich information (arrays, metadata, etc.) related to the other groups. Still, we want to keep them separate for various reasons:

  • We want to have GIS / CRS and coordinate data separately.
  • We want to have auxiliary variables separate from actual array data.
  • We want them to be able to share dimensions and coordinates.
  • Seismic data can have additional user interpretation created later, which we would want to keep in another group.

One of the first challenges for using Xarray as a backend is that Xarray can only work with groups with data duplication. I know we can write datasets into different groups of a Zarr / NetCDF; if I remember correctly, each group will have a copy of the coordinates, dimensions, etc., associated with the group data.

Is there a workaround to this? Would you be able to suggest a better alternative to our thought process?
Given that Zarr v3 will separate performance-critical metadata from dataset groups, maybe it won't be a big problem for us anymore, but that is quite far from being mainstream.

You may want to look into Xarray Datatree - https://xarray-datatree.readthedocs.io/ - a new package created by @TomNicholas. Soon this will become part of Xarray proper (see pydata/xarray#7418).

@tasansal I would love to help you get going with xarray! It sounds like datatree could fit some of your needs too.

One of the first challenges for using Xarray as a backend is that Xarray can only work with groups with data duplication. I know we can write datasets into different groups of a Zarr / NetCDF; if I remember correctly, each group will have a copy of the coordinates, dimensions, etc., associated with the group data.

Datatree empowers you to work with many groups at once. However at the moment you might still need to duplicate things across groups. One long-term solution to this might be to implement symbolic nodes in datatree, but I expect that using xarray and datatree would streamline your code a lot even without that feature.

Xarray and datatree work well with zarr already, so that should work nicely for you.

Hey @rabernat and @TomNicholas

I am taking a stab at making our backend Xarray. I took a look at extending Xarray and sgkit libraries.

My understanding is:

  1. Do not inherit DataArray and Dataset if you want to re-implement the whole api.
  2. Use the custom dataset accessors to add more properties etc.
  3. Be like sgkit.

Here are some of the things we want to have on top of regular Xarray functionality:

  • a. Implement domain-specific repr and html_repr (adding more information to default reprs)
  • b. Add more required but hidden metadata (like _ARRAY_DIMENSIONS) that won't show in repr, but used for internal representation of the data (maybe doing "a" first will handle this if it hides anything with the _ prefix).
  • c. Have metadata conventions similar to ZEP0004
  • d. Utilize lower-level Zarr machinery like Zarr locks, fsspec caching, etc.
  • e. Support Zarr v3.
  • f. Have custom methods to access specific parts of the dataset.
  • g. Have hidden variables that will have a suffix on disk, but it will be used to mask/unmask data. Similar to numpy's masked arrays, where you keep a bool mask with the array data but it is transparent to the user.

Given the above assumptions and requirements, what approach would work best for us? I am leaning towards option 1, unfortunately, since it gives the ultimate flexibility. But if these can all be done with option 2, that would be better!

I also noticed API documentation using option 2 is a bit hacky using a special sphinx extension. We were planning to move to Mkdocs, which may cause a problem. Any thoughts on what to do here?

To give you an idea about our roadmap; we are adding these features to MDIO.

  • Strict data models with version control.
  • Schematized dataset creation from JSON using Pydantic.
  • Separate the energy domain (oil & gas, wind, solar) functionality as plugins to MDIO and make the core more lightweight.
  • Domain-specific out of the box schemas for Seismic, Wind, and more in the future.
  • Somehow try to keep everything backwards compatible :-)

Thanks!

Hi @tasansal! This all sounds very ambitious and exciting!

My understanding is:

Yep pretty much! Also check out this new page on interoperability in xarray I wrote. (It's from a PR but should be released as part of the main docs soon).

Going through your list of features one-by-one:

a. Implement domain-specific repr and html_repr (adding more information to default reprs)

I'm actually not sure what the best way to fully overwrite the repr without subclassing would be. Monkey-patching xarray classes on import seems a bit hacky but might be enough...

b. Add more required but hidden metadata (like _ARRAY_DIMENSIONS) that won't show in repr, but used for internal representation of the data (maybe doing "a" first will handle this if it hides anything with the _ prefix).

Adding additional but hidden information would be a major change to the xarray data model - just storing the information normally but hiding it via the repr would be a lot easier if (a) is solved.

c. Have metadata conventions similar to ZEP0004

This seems fairly decoupleable from the other ideas, but for inspiration you should look at cf-xarray (which interprets CF conventions) and xarray-dataclasses.

d. Utilize lower-level Zarr machinery like Zarr locks, fsspec caching, etc.

Anything like this would likely be of interest to the wider xarray / Zarr community, so could be implemented as improvements to xarray's Zarr backend, for example.

e. Support Zarr v3.

This is in-progress for xarray, with most of the effort currently focused on making zarr-python support v3.

f. Have custom methods to access specific parts of the dataset.

This is easy using a custom accessor.

g. Have hidden variables that will have a suffix on disk, but it will be used to mask/unmask data. Similar to numpy's masked arrays, where you keep a bool mask with the array data but it is transparent to the user.

I'm not sure I totally understand this one, but (1) if the mask is data-dependent, I feel it should still be explicitly listed instead of hidden, and (2) operations which require the mask could be re-implemented on an accessor. Still, this might be a decent reason to subclass.

Strict data models

Again take a look at xarray-dataclasses.

Given the above assumptions and requirements, what approach would work best for us?

I am leaning towards option 1, unfortunately, since it gives the ultimate flexibility.

I think it sounds plausible to solve all of this without subclassing (but I don't fully understand all the requirements, so please don't take that as gospel!). If you do decide you want to subclass then it would be amazing if you could help us out with a few upstream contributions to make the subclassing easier. See pydata/xarray#3980

I also noticed API documentation using option 2 is a bit hacky using a special sphinx extension. We were planning to move to Mkdocs, which may cause a problem. Any thoughts on what to do here?

We also have plans to move our documentation to markdown (using MYST), so we would also be interested in any solution to this.

Does that help?