pangeo-forge/staged-recipes

Proposed Recipes for NOAA GEFS Reanalyzed

Opened this issue ยท 3 comments

Source Dataset

The dataset consists of 20 years of historical NOAA GEFS data. GRIB2 files were converted to netCDF files to the CONUS extent. Each netCDF file corresponds to one day. Each file also contains all the variables and ensembles. There are currently 7305 files. We are only interested in data on the issue date and comparing this data across ensembles. We are working toward converting all the netCDFs into zarr for easy access to the whole dataset. The files are currently stored on our high performance computer and will be potentially used for future projects.

  • Link to the website / online documentation for the data
  • netCDF
  • one file per day, all variables in each file, all ensembles in each file
  • The source files are located in our "in-house" HPC. We access the files via Secure Shell Protocol (SSH)
  • Files can be accessed with our individual RCA tokens.

Transformation / Alignment / Merging

The files should be concatenated on the time dimensions. At the moment, we only need data on the issue date. However, in the future we will probably look at all the forecasting data.

Output Dataset

Interested in converted netCDFs to zarr.
If possible, provide details on how you would like the output to be structured
chunksize = 567 Mb (about the size of file)
chunk = time : (one chunk for each input file)
However, open to suggestions.

Below is some of the code I am working with:

import os,sys
import glob
import xarray as xr
import zarr
from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.recipes import XarrayZarrRecipe
from pangeo_forge_recipes.patterns import FilePattern, ConcatDim, MergeDim

datadir = '/caldera/projects/usgs/water/iidd/datasci/data-pulls/reanalysis-gefs-canonical/4_combine_vars/out/'

flist = glob.glob(os.path.join(datadir,'*_*_*_*.nc'))
flist.sort()
print(f"Found {len(flist)} files")

def preprocess(ds):
    ds = ds.assign_coords({'time':ds.time,'latitude':ds.latitude,'longitude':ds.longitude})
    ds = ds.sel(time=slice(ds.time[0]))
    return ds

fname = 'noaa_gefs_reanalysis_CONUS_2000010100.nc'
ds = xr.open_dataset(os.path.join(datadir,fname))

ntime = len(ds.time)       # the number of time slices
chunksize_optimal = 567e6  # desired chunk size in bytes
ncfile_size = ds.nbytes    # the netcdf file size
chunksize = max(int(ntime* chunksize_optimal/ ncfile_size),1)
print("chunksize is ",chunksize)
target_chunks = ds.dims.mapping
# target_chunks['time'] = chunksize
target_chunks['time'] = 1

pattern = pattern_from_file_sequence(flist, "time")
recipe = XarrayZarrRecipe(
    file_pattern=pattern,
    target_chunks=target_chunks,
    process_chunk=preprocess
)
print(recipe)

recipe_pruned = recipe.copy_pruned()
recipe.file_pattern 
recipe_pruned.file_pattern 

# run_function = recipe_pruned.to_function() 
run_function = recipe.to_function() #take very long to process. Has not been able to complete
run_function()

ds = xr.open_zarr(recipe.target_mapper)

๐Ÿ‘‹ Ted and thanks for the suggestion!

Can you explain how your in-house processed GEFS data are different from the official GEFSv12 data hosted on AWS (#17)? What would be the advantage of building a recipe from your copy vs. going straight to the source.

In general, we would prefer to pull data from the official archive, in order to keep a clear provenance chain. If there is additional processing needed in flight, that can be accomplished in the recipe.

๐Ÿ‘‹ Ted and thanks for the suggestion!

Can you explain how your in-house processed GEFS data are different from the official GEFSv12 data hosted on AWS (#17)? What would be the advantage of building a recipe from your copy vs. going straight to the source.

In general, we would prefer to pull data from the official archive, in order to keep a clear provenance chain. If there is additional processing needed in flight, that can be accomplished in the recipe.

๐Ÿ‘‹ That makes sense. As far as the data itself, the netCDFs we have a smaller number of files that only include the first 10 days of the reforecast at the CONUS extent. Our thought process was storing the netCDF files in the same file system as our ML model to reduce computation time. In addition, we are aiming to store the files in a format familiar to others for reproducibility.

If it's possible to pull processed data directly from the source in a computationally inexpensive way with the recipe that would be great. For the long term, we were hoping to create a pipeline straight to AWS, where we would also run our models in the cloud. However, we are still testing this particular model on the HPC. I started looking into pangeo-forge recipes it took a significantly long time to convert the 7000 netCDFs to zarr with mf_dataset.

@rabernat Following up on preprocessing, I am looking to subset the timesteps for the first day and to keep the attributes for each concatenated netCDF file. Would that be achieved with this?

def preprocess_input(ds,fname):
    ds = ds.sel(time=ds.time[0:7])
    ds = ds.sel(latitude=slice(39, 40),longitude=slice(-76, -75))
    xr.set_options(keep_attrs=True)
 return ds
pattern = pattern_from_file_sequence(flist, "time")
recipe = XarrayZarrRecipe(
    file_pattern=pattern,
    cache_inputs = True,
    consolidate_zarr = True,
    target_chunks=target_chunks,
    xarray_concat_kwargs={'combine_attrs': {'drop_conflicts'}},
    process_input= preprocess_input
)
print(recipe)

I am finding that only the attributes from the first netCDF are being kept.