pangeo-forge/staged-recipes

Proposed Recipes for ERA5

jhamman opened this issue · 16 comments

There are currently a few subset's of the ERA5 dataset on cloud storage (example but none are complete or updated regularly. It wont be a trivial recipe to implement with Pangeo-Forge but it would be a good stretch goal to support such a dataset.

Source Dataset

Transformation / Alignment / Merging

Most likely, the best way to access and arrange the data is in 1-day chunks, concatenating along the time dimension. Given the large user pool for this dataset, I would suggest this recipe does as little data processing as possible.

Output Dataset

One (or more?) Zarr stores. Hourly data for all available variables, all pressure levels, etc.

I was just talking about this with @spencerahill. We also had a meeting with ECMWF about this last spring.

It is a really big job.

Yes. Context: 3-yr NSF CLD grant starting hopefully within next month or two, 6 mo / yr my time and hopefully next year a graduate student. We're doing wavenumber-frequency spectral analysis of energy transports in low latitudes using ERA5. So that requires 6-hourly or higher resolution, up to a dozen or so vertically defined variables. Many TBs.

My default plan was to just use the CDSAPI to download it to local cluster but talking w/ Ryan sounds like this could plug into pangeo efforts nicely! Which would be fun for me, having been mostly watching from the sidelines for a few years now :)

It is a really big job.

It is! But this is the crew to do it!

We're in a similar position to you @spencerahill . We are currently using the AWS public dataset but we're about to outgrow the offerings there. We could pull data from the CDSAPI to our own cloud bucket but that runs counter to the mission here.

We also had a meeting with ECMWF about this last spring.

@rabernat - Any takeaways you can share?

Here are some notes from Baudouin Raoult after our meeting last April

I would still like another virtual meeting with you, during which we present to you our data handling system, because I don’t think we have yet a common understanding of the work to be done, and discussing it in writting would be very inefficient. Below is a short list of things that comes to mind immediately:

  • The MARS system does not handle files, but provide a direct access to 2D fields using an hypercube-based index (https://confluence.ecmwf.int/download/attachments/45754389/server_architecture.pdf?api=v2 , a paper from 1996, before “data cubes” where popular 😊. The whole archive can be seen has a 12-dimensions single dataset of 300PB/3e+12 fields . A 2D fields is the smallest accessible object (think chunk), and users can access the archive along any dimensions (dates, parameters, levels,…). A lot of the data is on tape, so one cannot just scan a directory full of files. Any recipe will have to access MARS.
  • Most of the upper-air fields are in spherical harmonics, and the few upper-air grid-point fields are on a reduced gaussian grid. All surface fields are also on the reduced gaussian grid. These are not supported by CF/Xarray. The data needs to be interpolated to a regular lat/lon grid. This requires a non-trivial amount of time and resources.
  • We still have to address the issue of the two dimensions of time of forecasts fields (although this is not so bad with ERA5). Same issue for all accumulations, fluxes and other maximum (e.g. the ‘cell-methods’). That discussion has been going on from a couple of decades (see https://www.ecmwf.int/sites/default/files/elibrary/2014/13706-grib-netcdf-setting-scene.pdf, some slides I did in 2011 that illustrate most of the points I am making here)
  • We need to decide how to organise the dataset, not all variables having the same time frequency (although I think is may be OK with ERA5), and how much metadata are we ready to provide.
  • The full ERA5 is around 5 PB using 16 bits per values. The size may explode if we are not careful with the packing/compression of the target. There will be a lot of trial and error at the beginning, and this is something we cannot afford to much, considering the volumes involved.

Having done similar exercises in the past (such as delivering previous reanalysis to NCAR and other institutes), I think the “recipe” will have to run for many months to extract, interpolate, reformat, transfer and store the data. This needs to be done in parallel and having some checkpoint/restart facility as one cannot run a single process for months (planned system sessions, network issues, disks issues, etc). Furthermore, the data transfers will certainly be interrupted when our data centre move from the UK to Italy later this year.

So before we start writing some code, I would like that we have a clear definition of the scope of that project. We also need to decide who is going to be part of that activity and where the heavy lifting is happening

That was sufficiently intimidating that we tabled the discussion for the time being. Now that Pangeo Forge is farther along, and we have people who are interested in working on it, I think we can pick it up again.

Exciting! A few questions/comments:

  • Do I understand correctly that the proposed temporal extent for this recipe is 1950 - present, plus regular appending? If so, noting that appending is remains an open (but presumably resolvable) issue pangeo-forge/pangeo-forge-recipes#81
  • I assume there is no way around moving ~26,240 (Jan 1, 1950 to present) daily NetCDFs to a cloud cache? That is, the CDS API probably won't work with pangeo-forge/pangeo-forge-recipes#218 ?
  • Assuming we do need to cache the files, noting that the CDS API appears to diverge from the Pangeo Forge FilePattern assumption that datasets will be accessible via a human-readable URL. Instead, most of the identifying information about the requested dataset is encoded into a dictionary (e.g., here) which becomes a JSON POST request (👈 here, request is the user-defined dictionary).
  • I wonder if this API design pattern (dataset identifiers in JSON, rather than URL) is common enough to consider supporting it in pangeo-forge-recipes, or if it's preferable just to handle caching with some one-off scripts outside of Pangeo Forge, and then write the recipe to point to the cloud cache directly?
  • If we do attempt file transfer via Pangeo Forge, we'll also have to consider if/how fsspec can somehow be used as wrapper for CDS API requests.

To clarify, what Baudouin was proposing was that we go around the CDS API and talk directly to MARS, their internal archival system.

As a heads up, there is a full copy of the Copernicus 0.25 degree version of the ERA-5 Reanalysis dataset maintained at NCAR under the following dataset collections, which are preserved as time-series NetCDF-4 files:
https://doi.org/10.5065/YBW7-YG52
https://doi.org/10.5065/BH6N-5N20

These may be easier to access and stage to public cloud resources unless you need the raw spherical harmonics and reduced gaussian grids at model level resolution, which are only available through ECMWF MARS. You can also access the NCAR maintained datasets by direct read from NCAR HPC systems as an NSF funded researcher. https://www2.cisl.ucar.edu/user-support/allocations/university-allocations

Cross-reference: #22

Question (which may reveal just how little I've worked with and understand the cloud): would it be useful to this effort to have some nontrivial chunk of the ERA5 data downloaded to the cluster at Columbia (berimbau) we're using for our project, to subsequently be uploaded to the cloud? My big concern w/r/t tying my project's science tightly to this pangeo-ERA5 effort is our project's science potentially getting held up, maybe in a big way, if there end up being unforeseen delays etc. in getting the data onto the cloud. Whereas I already have a functional pipeline for downloading the ERA5 data I'll need directly to that cluster via the CDS API, as well as the computational power I'll need at least for the preliminary analyses.

So, in this scenario, I'd start downloading the data I need basically right away to our cluster, and then once on the pangeo side things are ready I could upload from our cluster to the cloud. The upside for pangeo of this direct transfer from us would be no waiting on the CDS system queue etc.

Thoughts?

would it be useful to this effort to have some nontrivial chunk of the ERA5 data downloaded to the cluster at Columbia (berimbau) we're using for our project, to subsequently be uploaded to the cloud?

Short answer: no, it would not be particularly useful for you to manually download data and store in on your cluster. That is sort of the status quo that we are trying to escape with Pangeo Forge. The problems with that workflow are

  • it is probably hard to reproduce because it involves manual steps
  • it may be hard to update, meaning that the data will eventually go stale

The goal with Pangeo Forge is to develop fully automated, reproducible pipelines for building datasets.

However, I recognize the tension here: you want to move forward with your science, and you need ERA5 data to do so. You can't wait a year for us to sort out all of these issues.

Here is an intermediate plan that I think might balance the two objectives. You should write a Pangeo Forge recipe to build the dataset you need on berimbau. The process of writing this recipe will be very useful for the broader effort. Note that this won't be possible until pangeo-forge/pangeo-forge-recipes#245 is done, since that will be required to get data from the CDS API.

@alxmrs has also been working on a "ERA5 downloader" tool which could be very useful here.. Alex is that released yet?

So a possible order of operations would be:

  • Spencer gets some basic familiarity with Pangeo Forge recipes by going through the docs / tutorials (feedback welcome)
  • In the meantime, Ryan finishes pangeo-forge/pangeo-forge-recipes#245
  • Alex releases his ERA5 downloader tool
  • Spencer starts creating the recipe with a custom opener for downloading ERA5
  • Spencer runs this recipe on berimbau to generate the data he needs
  • We use this experience to inform the broader, more generic effort to bring ERA5 into the cloud

However, I recognize the tension here: you want to move forward with your science, and you need ERA5 data to do so. You can't wait a year for us to sort out all of these issues.

Exactly.

Thanks @rabernat. That all makes sense. I'm subscribed to the relevant pangeo repos now and in particular will keep an eye on pangeo-forge/pangeo-forge-recipes#245.

And going through the docs/tutorials sounds like a great task when I'm procrastinating on a review / revision / etc. in the near-ish term future.

@spencerahill, once you've started on your recipe, please feel free to @ me in a comment here with any questions. The documentation is far from exhaustive, so don't be discouraged if there's something that doesn't make sense. I'll make sure any questions you have get answered, and we can use any insights we gain to improve the official docs.

Excellent, thanks! Also IIRC you are at least sometimes in-person at Lamont(?) If so would be fun to meet+chat in person too

@alxmrs has also been working on a "ERA5 downloader" tool which could be very useful here.. Alex is that released yet?

Hey Ryan! The release is in progress – I have just submitted the codebase for internal approval. Not sure about the ETA, since we are in late December. Usually, this last part of the process takes about ~1 week.

As soon as it's public, I will post about it here.

at least sometimes in-person at Lamont(?)

Sadly I'm rarely there as I work out of a home office in California. Even if you sail through the recipe development process without any issues, I'd love to set aside some time to catch up over video either way. 😊

I'm happy to announce that the aforementioned tools to help download era 5 data are public, as of today! Please check out weather-dl at https://github.com/google/weather-tools.

@spencerahill @jhamman I'm happy to answer and questions you have along the way.

CC: @rabernat