MITgcm/xmitgcm

Reading mult-tile output.

Opened this issue · 26 comments

Hello,
I was trying to read multi-tile output using xmitgcm. Files with names such as Varname.iternum.tilenum1.tilenum2.data/meta. xmitgcm does not seem to have the capability to read this output right now, instead users need to output data in single file format (using flag singlecpuio=.TRUE. in data file). It will be nice if someone can add the capability to read tiled data.
Thanks,
Dhruv

This is a great suggestion @dhruvbalwada. Also partially related to #27 raised by @braaannigan.

Any update on this? I am doing a large run and need to write tiled outputs. It would be great to have this capability!
Thanks for all the development effort!

Thanks for the ping Matt! What we are doing with llcreader right now is considerably more complicated than reading multi-tiled input.

However, matt, it sounds like instead your are interested in writing multi-tiled output. Can you confirm that? What is the source of the data you want to write? Are you calculating it in python? Reading from another file?

In both cases, we might be able to hack zarr to do this for us. A zarr store with no compression is equivalent to MDS files on a binary level.

Sorry for the vague and ambiguous language! I meant that I need to have the mitgcm write to tiled outputs. I would like xmitgcm to be able to read tiled outputs (e.g. of the format T.0000000000.???.???.data).

Ok so this is something I think I can work on in the next few weeks. What would be very helpful would be to have a test dataset with some example output I could use to debug. Ideally this dataset would be as small as possible (< 5 MB). Just the bare minimum of files needed to try it out. I would need both .data and .meta files, plus available_diagnostics.log if you're using diagnostic output.

I can easily provide these files from a tutorial.
(e.g. MITgcm/verification/global_oce_biogeo_bling)
The files will be less than 1M. How to get them to you? Email?
Thanks!

Sure, email is fine. We keep our test data here: https://figshare.com/collections/xmitgcm_test_data/4362224
After you email me, I will add them to that.

Thanks a lot for your help!

Turns out its 3.8MB. Find it here:
http://sose.ucsd.edu/DATA/OTHER/bin4ryan.tar

Sorry - made it smaller - no need to have it 3.8MB
You can redownload if you want. Has less diagnostics and is now 1.5MB

Hi all, I'd just like to +1 on this as a very useful feature if it can be implemented. Thanks!

Hi all, another ping for this feature to be added. At MITgcm we would like to push users into using xmitcgm as the de facto approach to reading in output... happy to help w/testing (tutorial I'm currently working on writes tiled output). Thanks!

Hello everyone,

This feature is something I'd really like implemented, and over the next few weeks I think I should be able to set some time aside to do this. Before I jump in though, I wanted to check a few things:

  1. Is anyone currently working on this? If they are and there's anything that could do with some extra man power, I'd be more than happy to help out.

  2. Does anyone who has worked on this in the past have any tips? I can see the issue has been open for several years -- I can't tell whether this is because it's a really hard problem or people just haven't had the time to work on it. If it's the former, I'd appreciate any advice about things that may or may not work.

  3. Do people have any strong ideas about how reading multi-tile output should work? I'm thinking that the feature is best implemented with a function that can be called in a similar way to xr.open_mfdatset and xmitgcm.open_mdsdataset.

All the best,
Fraser

Hello everyone,

This feature is something I'd really like implemented, and over the next few weeks I think I should be able to set some time aside to do this. Before I jump in though, I wanted to check a few things:

  1. Is anyone currently working on this? If they are and there's anything that could do with some extra man power, I'd be more than happy to help out.
  2. Does anyone who has worked on this in the past have any tips? I can see the issue has been open for several years -- I can't tell whether this is because it's a really hard problem or people just haven't had the time to work on it. If it's the former, I'd appreciate any advice about things that may or may not work.
  3. Do people have any strong ideas about how reading multi-tile output should work? I'm thinking that the feature is best implemented with a function that can be called in a similar way to xr.open_mfdatset and xmitgcm.open_mdsdataset.

All the best,
Fraser

@fraserwg I think this would be a fantastic thing to work on. The interface you mention in (3.) sounds reasonable. The raw tiled files together with their .meta files are quite similar in spirit to zarr/dask chunking ideas, which xarray understands. @jahn may have some python that could help with ingesting tiled output in ways that can then be mapped to zarr/dask chunking specification.

Thanks, I've got started and want to write some tests for the code I have so far. Would it be possible to upload the following file to the test data repository as something like tiled_gyre.tar.gz?

xmitgcm_test.tar.gz

When unzipped the directory contains output from a variation of the barotropic gyre tutorial with two levels on a tiled grid, including both grid and data files.

@fraserwg - yes I can do that later.

I was thinking a bit more too. An alternate approach might be to have a script/program that took MITgcm tiled and made it into a Zarr style store. I am not sure whether that is practical, but it might be worth a quick chat/zoom?

It's great to hear that folks are interested in working on this. I would be thrilled to support the effort.

  1. Is anyone currently working on this? If they are and there's anything that could do with some extra man power, I'd be more than happy to help out.

I think I am the only one who has really volunteered in the past, and obviously I have not actually done it yet. It is unlikely I would have sufficient time any time soon. 🤦

2. Does anyone who has worked on this in the past have any tips? I can see the issue has been open for several years -- I can't tell whether this is because it's a really hard problem or people just haven't had the time to work on it. If it's the former, I'd appreciate any advice about things that may or may not work.

It's not particularly hard to simply read a tiled binary array lazily into a dask array. The hard part is dealing with all of the legacy xmitgcm code around that. Over time, we have had to introduce a lot of complexity into this package, mostly to deal with special cases related to LLC, cubed-sphere, and (the worst!) ASTE grids.

One relatively straightforward way to implement this would be to refactor the utils.read_mds function to handle the tiled output:

xmitgcm/xmitgcm/utils.py

Lines 87 to 89 in c793ecc

def read_mds(fname, iternum=None, use_mmap=None, endian='>', shape=None,
dtype=None, use_dask=True, extra_metadata=None, chunks="3D",
llc=False, llc_method="smallchunks", legacy=True):

An alternate approach might be to have a script/program that took MITgcm tiled and made it into a Zarr style store.

The "virtual zarr" file approach is very appealing in terms of its simplicitly. However, having worked heavily on both Zarr and xmitgcm, I see a major roadblock here: multi-variable outputs. The existence of multi-variable outputs (e.g. a diagnostic file that has 10 different variables packed into a single tile), breaks the simple 1:1 correspondence between MITgcm mds tiles and zarr chunk files. You would need an additional layer of abstraction that deals with this. @christophernhill - is that something you and Oliver have already dealt with?

@rabernat - thanks. I'll track down @jahn . I think I can see a way to do this, but I haven't thought it through. @jahn has something working with lots of biogeochemical fields, but not sure if they were originally lumped together in a single diagnostics blob.

jahn commented

I haven't done multiple variables per file with zarr. I would just treat the records/variables as an extra dimension and make them visible as variables at a higher level (as I think @rabernat suggests). That's work, of course.

I have a hacky prototype for reading lots of time records (not tiled!) via zarr and dask. It subclasses the zarr DirectoryStore to recognize non-consecutively numbered files as chunks. The routine mdsfiles creates a .zarray file and then opens the dataset using the new store. Something similar might be useful for tiled files. This version doesn't read any metadata, though. Ranges of iteration numbers and per-file shapes have to be provided. One could extract these from meta files. This was mostly for being able to read millions of files into dask in a reasonable time without having to reorganize them.

For this to be maintainable in the long term, one would have to get any custom stores integrated into zarr.

This is the script in case it could be useful: mdszarr.zip

This is great, thanks for sharing!

For this to be maintainable in the long term, one would have to get any custom stores integrated into zarr.

There is zero chance of and MITgcm-specific store being added to zarr-python proper. But we could provide a third-party store implementation that can interoperate with Zarr and use it internally in xmitgcm.

FWIW, I feel like this is a big enough departure from the current xmitgcm implementation to warrant a brand new package, e.g. mds-zarr or similar.

jahn commented

There is zero chance of and MITgcm-specific store being added to zarr-python proper. But we could provide a third-party store implementation that can interoperate with Zarr and use it internally in xmitgcm.

It wouldn't have to be that MITgcm-specific. Flexible number formatting and >1 step size for chunk names would be sufficient. I understand that this may still be a stretch. If there's a procedure for third-party stores that would be good.

Thanks @christophernhill, @jahn and @rabernat for these suggestions. I've spent a bit of time today exploring/playing around with the source code and have started to go down the route of modifying the read_mds function to read tiled data. To me this seems to be a slightly easier route than delving into Zarr as the code that does the formatting of the Datasets is already there.

I've managed to get the read_mds function happy with reading output from an individual tile and have modified the code in mds_store to allow the _MDSDataStore class to represent a single tile at a single time-step and, in turn, have managed to load this as an Xarray Dataset. From here I think it should be as easy as looping over the tiles and glueing them into one big Xarray Dataset. That said I'm not sure how optimal this solution is in terms of speed/efficiency? Multiple time-steps seem to be handled in this way, however, I'm not too sure how concatenation along a new axis compares to joining along two existing axes from a performance point of view?

From here I think it should be as easy as looping over the tiles and glueing them into one big Xarray Dataset. That said I'm not sure how optimal this solution is in terms of speed/efficiency?

There are two options here. As you suggested, you could loop over the tiles and build up a dask dataset using dask.array.concat / dask.array.stack etc. A slightly faster option would be to figure out how to build the entire dask graph in one go. This is how llcreader works.

# manually build dask graph
dsk = {}
token = tokenize(varname, self.store, nfacet)
name = '-'.join([varname, token])
dtype = self._dtype(varname)
# iters == None for grid variables
def _key_and_task(n_k, these_klevels, n_iter=None, iternum=None):
if n_iter is None:
key = name, n_k, 0, 0, 0
else:
key = name, n_iter, n_k, 0, 0, 0
task = (_get_facet_chunk, self.store, varname, iternum,
nfacet, these_klevels, self.nx, self.nz, self.nface,
dtype, self.mask_override, self.domain,
self.pad_before, self.pad_after)
return key, task
if iters is not None:
for n_iter, iternum in enumerate(iters):
for n_k, these_klevels in enumerate(_chunks(klevels, k_chunksize)):
key, task = _key_and_task(n_k, these_klevels, n_iter, iternum)
dsk[key] = task
else:
for n_k, these_klevels in enumerate(_chunks(klevels, k_chunksize)):
key, task = _key_and_task(n_k, these_klevels)
dsk[key] = task
return dsa.Array(dsk, name, chunks, dtype)

I've managed to get the read_mds function happy with reading output from an individual tile and have modified the code in mds_store to allow the _MDSDataStore class to represent a single tile at a single time-step and, in turn, have managed to load this as an Xarray Dataset. From here I think it should be as easy as looping over the tiles and glueing them into one big Xarray Dataset. That said I'm not sure how optimal this solution is in terms of speed/efficiency? Multiple time-steps seem to be handled in this way, however, I'm not too sure how concatenation along a new axis compares to joining along two existing axes from a performance point of view?

@fraserwg great - I thinking looping over tiles is fine to start. The Desk and friends graph is nice, but it is not clear it is always worth the effort at this stage - in practice can be a little unpredictable and doesn't always perform as well as one would hope, it sometimes gets in its own ways buffering and juggling lots of pieces. I think if it was me I would do a deterministic behavior concat/stack thing as a first capability. I think that alone would be a big help for many people and make the world a better place 🙏 🥰 . I would be happy to zoom, try some code etc... if useful?

@christophernhill – a zoom call at some point could be useful. I've changed my approach slightly, so now I'm concatenating the tiles earlier on, in order to reduce code duplication and increase performance.

I've just got a prototype working that doesn't seem to break the behaviour of the original code. Loading a tiled dataset set should now be as simple as calling open_mdsdataset(<<your normal arguments>>, tiled=True ).

The code could definitely do with some tidying and is currently untested on anything other than the example folder I sent over the other day. It currently doesn't have any tests and the documentation of various functions needs updating - this should be relatively simple to do, but I'm away a fair bit over the next few weeks so this might take a bit of time.

After I've set some tests up (or possibly whilst I'm doing that) it would be useful to see how it performs on real life examples - if anyone in this thread is interested in trying the code I'm more than happy to share it in its unpolished form (it's available as a branch on my fork of xmitgcm at the moment).

For folks following this thread, I have started experimenting with exposing mds data via a zarr adaptor. Proof of concept ie here:

https://gist.github.com/rabernat/0149106f271ad384af0c9ad102196621

I definitely think we should rewrite xmitgcm using this new framework. It will be a very different package at the end of that. But none of these tools (zarr, fsspec, reference maker, etc.) existed when I started making xmitgcm.

I had a similar idea, and posted it in #337 before @cspencerjones pointed me here. The difference is that since the discussion in this thread then (1) kerchunk was made (starting early 2021) (2) virtualizarr was made (3) there is active discussion about chunk manifests becoming part of zarr upstream.

Essentially it's still the same basic idea (create a mapping from zarr chunks to chunks of data in mds files), but the more shared infrastructure for creating and managing that mapping (the "chunk manifest") the better. The suggestions here can be thought of as writing a kerchunk / virtualizarr "reader", which inspects mds data and outputs chunk manifests.

The "virtual zarr" file approach is very appealing in terms of its simplicitly. However, having worked heavily on both Zarr and xmitgcm, I see a major roadblock here: multi-variable outputs. The existence of multi-variable outputs (e.g. a diagnostic file that has 10 different variables packed into a single tile), breaks the simple 1:1 correspondence between MITgcm mds tiles and zarr chunk files. You would need an additional layer of abstraction that deals with this.

I don't know that much about the mds format, but I would think the chunk manifests can deal with this, so long as individual variables are not compressed together (i.e. so long as individual variables live in different chunks, it's okay if those chunks are in the same file). See zarr-developers/VirtualiZarr#218.