Make greenland_data a separate, cleaner repository

Question

Make greenland_data a separate, cleaner repository

Closed this issue 9 years ago · 10 comments

Currently the repository to load Greenland data is located under outletglacierapp/models/. It should be made a separate repository. Moreover, it currently contains too many ad-hoc tests (e.g. if dataset == "morlighem" do ...), and should be made more generic to ease addition of subsequent datasets. An object-oriented design could be a neat solution.

Answer 1 · 2015-04-12T11:23:21.000Z

For example, each Greenland dataset can be attributed a corresponding python module, with following attributes (module-wide variables):

name : name of the dataset that identifies it uniquely for example "rignot_mouginot_2012"
variables : list of 2-D variables available in the dataset, following some convention (e.g. "surface_elevation", "bedrock_elevation", "surface_velocity", "ice_thickness", "surface_elevation_rate"), including errors ("surface_elevation_error") and so on
grid_mapping : dictionary of grid mapping parameters following C-F1.6

And the following functions:

load_bbox(bbox=None, variables=None, maxshape=None) : load 2-D data, given bbox bounding box (left, right, bottom, top) in meters or degrees in its own coordinate system, and resampling the output (every 2nd, 3nd...) so that the returned shape remains below maxshape, if provided. Returns: 2-D dimarray.Dataset with corresponding variables, dimensions 'y', 'x', and 'grid_mapping' as attribute. If variables is None, returns all variables present in this dataset.
load_path(path, variables=None) : load data along the path [(x0,y0), (x1,y1), ...(), ], in its own reference system. Returns: 1-D dimarray.Dataset with corresponding variables, single dimension 's'. To offer an alternative, optimized way to load_bbox when only a path is desired.

Then make a separate module common.py that provides helper functions to implement these methods in certain common (e.g. data on netCDF file following CF1.6 conventions...with standard variable names), or unoptimized cases (decorator so that the returned dataset is resampled after being loaded, load_path that just takes the returned argument from load_bbox). Finally, there could be yet another module, or simply __init__.py, that imports all datasets and deals with defining load_bbox and load_path automatically from these simple assumptions, if the function is not already defined in the dataset's own module. The common.py module could also provide functions for the user to transform a set of coordinates, or bounding box, from one grid_mapping to another.

In a second time, there could also be additional, handy functions to download and install the datasets, check dataavailability etc...But this is not necessary. At the very least, some instructions about how to fetch the data would certainly be relevant.

download(datadir) to download the dataset under some default data root directory ?
check(datadir) check that everything is fine

Answer 2 · 2015-04-12T11:29:52.000Z

The functions above could also have an optional time argument which is set on a specific year by default (e.g. present-day), to accommodate for time-varying 2-D fields.

Additionally, a function load_timeseries(xy=None, times=None, variables=None) would be needed to load a timeseries at a specific location (again, in the dataset's own reference system).

Answer 3 · 2015-04-12T12:26:15.000Z

It would look like that:

presentday_greenland.py

""" Present-day Greenland Dataset 

Some more info, including source, authors, how to get the data.
"""
from os.path import join
import numpy as np
import netCDF4 as nc
import dimarray as da
from .settings import DATAROOT
from .common import get_slices_xy

NAME = "presentday_greenland"
_NCFILE = join(DATAROOT, "Present_Day_Greenland","Greenland_5km_v1.1.nc")
VARIABLES = [''surface_elevation", "bedrock_elevation", "surface_velocity", "ice_thickness", "surface_elevation_rate"]
GRID_MAPPING = {'ellipsoid': u'WGS84',
     'false_easting': 0.0,
     'false_northing': 0.0,
     'grid_mapping_name': u'polar_stereographic',
     'latitude_of_projection_origin': 90.0,
     'standard_parallel': 71.0,
     'straight_vertical_longitude_from_pole': -39.0}  # in that case, one could just read it from netCDF

_namelookup = {"surface_elevation":"usrf", "bedrock_elevation":"topg", "ice_thickness":"thk", "surface_velocity":"surfvelmag","surface_elevation_rate":"dhdt"}

def load_bbox(bbox=None, variables=None, maxshape=None):
   # determine the variables to load
   if variables is None:
       variables = VARIABLES
   ncvariables = [_namelookup[nm] for nm in variables]

   # open the netCDF dataset
   nc_ds = nc.Dataset(_NCFILE)

   # determine the indices to extract
   x = nc_ds.variables['x1']
   y = nc_ds.variables['y1']
   slice_x, slice_y = get_slices_xy(xy=(x, y), bbox, maxshape)

   # load the data using dimarray (which also copy attributes etc...)
   data = da.read_nc(nc_ds, ncvariables, indices={'x1':slice_x,'y1':slice_y, 'time':0}, indexing='position')

   # close dataset
   nc_ds.close()

    # rename dimensions appropriately and set metadata
   data.rename_keys({ncvar:var for var, ncvar in zip(variables, ncvariables)}, inplace=True)
   data.dims = ('y','x') 
   data.grid_mapping = GRID_MAPPING  # if not already present in the netCDF
   data.dataset = NAME 

   return data

settings.py

from os import environ, join
DATAROOT = join(environ['HOME'], 'data', 'greenland')

common.py

def get_slices_xy(xy, bbox, maxshape):
        x, y = xy
        # determine start and stop indices form bounding box
        if bbox is not None:
            l, r, b, t = bbox  # in meters
            xstart, xstop = np.searchsorted([x[:], [l, r])
            ystart, ystop = np.searchsorted([y[:], [b, t])
        else:
            xstart, xstop = 0, x.size-1
            ystart, ystop = 0, y.size-1
        nx = (xstop-xstart+1)
        ny = (ystop-ystart+1)
        # sub-sample dataset if it exceeds maximum desired shape
        if maxshape is not None:
            shapey, shapex = maxshape
            stepy = int(ny/shapey)
            stepx = int(nx/shapex)
        else:
            stepy = stepx = None
        slice_x = slice(startx, stopx, stepx)
        slice_y = slice(starty, stopy, stepy)
        return slice_x, slice_y

# function factory to create a load path function from load_bbox
def create_load_path(load_bbox)
    def load_path_generic(path, variables=None):
         xs, ys = zip(*path) # [(x0, y0), (x1, ...)] into [[x0,x1..], [y0, y1..]]
         l = np.min(xs)
         r = np.max(xs)
         b = np.min(ys)
         t = np.max(ys)
        data2d = load_bbox(variables=variables, bbox=[l, r, b, t])
        datapath = data2d.take(indices={'x':xs, 'y':ys}, broadcast=True)
        return datapath
    return load_path_generic

__init__.py

from importlib import import_module  # generic module import (builtin)
from . import common
# register the modules
__all__ = ["presentday_greenland"]
# import all modules and redefine missing functions
for mod in __all__:
    m = import_module("."+mod, package="greenland_data") # import module by name
    if not hasattr(m, 'load_path'):
        m.load_path =  common.create_load_path(m.load_path)

Answer 4 · 2015-04-12T13:28:02.000Z

I like it! I have ideas along these lines too, I'll comment soon!

Answer 5 · 2015-04-12T13:28:49.000Z

By the way, check out my ice_data repository in the meantime...

Answer 6 · 2015-04-12T15:20:25.000Z

Ah cool! I see we go in the same direction ! Very nice, I always wanted to ask you for more data, I will have a look. I suppose you can only do that with certain datasets, right? Because the authors often want to keep some control on what the data become.

Happy to hear your suggestions to make the python package more general!

Answer 7 · 2015-04-12T15:42:05.000Z

An alternative would be to provide scripts, python or bash shells or whatever easily installable, that transform the authors' datasets into standard C.F.1-6 netCDF files...
But well, this is not incompatible.
One could just provide the typical C.F.1.6 loading function (not too far from the above) with a couple of tweaks (map_dimensions, map_variables to rename variables on-the-fly, inverted_y_axis) and any new dataset can be processed on the fly or be transformed on disk before processing (e.g. if not exists('transformed...'): transform_data(...); ... load_transformed(...)). Up to each new dataset's author.

Answer 8 · 2015-04-12T15:46:18.000Z

And yes, it makes sense to add a "domain" attribute (greenland or antarctica) to make this more general. Then I would call this repository pycedata or icedatapy or something pythonizing like that, to make clear it is about python code, not data themselves... Hmm, I need to think about something better. What about icedataio ? No output, though... commonice, haha. Or simply icepy... Ok, if you don't mind the lack of imagination, I would just go for icedata.

Answer 9 · 2015-04-12T16:28:08.000Z

An example fo use (with existing dimarray functions), and soon existing icedata.tools functions

 from dimarray.geo import transform, transform_vectors
 from icedata.greenland import presentday_greenland  as pdg, rignot_mouginot2012 as rm
 from icedata.tools import transform_bbox

 bbox = [-500e3, 0, -1500e3, -800e3] # (in meters)
 bbox_rm = tranform_bbox(bbox, pdg.grid_mapping, rm.grid_mapping)

 data_pdg = pdg.load_bbox(bbox)
 data_rm = rm.load_bbox(bbox_rm)

 data_rm_samegrid = transform(data_rm, from_crs=rm.grid_mapping, to_crs=pdg.grid_mapping, xt=data_pdg.x, yt=data_pdg.y)

Answer 10 · 2015-09-25T15:15:58.000Z

Done: https://github.com/perrette/icedata