Support for dask arrays

Question

Support for dask arrays

Opened this issue 9 years ago · 11 comments

I sort of stumbled upon this accidentally -- for some reason (that I'm currently trying to figure out), some variables passed from aospy to the upwind advection schemes in infinite-diff used dask arrays (instead of standard numpy arrays). This ended up raising a TypeError because the dask arrays do not support item assignment.

Do you have designs on supporting dask arrays in infinite-diff in the future? Particularly for computationally intensive things like MSE budgets, dask seems like it could be a valuable tool. This isn't a huge priority, since at the moment we don't rely on dask at all in aospy, but I just wanted to bring this up as I recently encountered it.

Beyond that (which I think ultimately stems from something within aospy or aospy_user), this package seems to be working very well for me so far!

Here's the full traceback for reference:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/Spencer.Clark/aospy_user/main.py in <module>()
    166
    167 if __name__ == '__main__':
--> 168     calcs = aospy_user.main(mp, prompt_verify=False)
    169
    170 

/home/Spencer.Clark/aospy_user/aospy_user/main.pyc in main(main_params, exec_calcs, print_table, prompt_verify)
    179     param_combos = cs.create_params_all_calcs()
    180     calcs = cs.create_calcs(param_combos, exec_calcs=exec_calcs,
--> 181                             print_table=print_table)
    182     return calcs

/home/Spencer.Clark/aospy_user/aospy_user/main.pyc in create_calcs(self, param_combos, exec_calcs, print_table)
    143                 if exec_calcs:
    144                     try:
--> 145                         calc.compute()
    146                     except:
    147                         raise

/home/Spencer.Clark/aospy/aospy/calc.pyc in compute(self)
    669         if not all(bool_monthly):
    670             full_ts, full_dt = self._compute_full_ts(data_in,
--> 671                                                      monthly_mean=False)
    672         else:
    673             full_ts = False

/home/Spencer.Clark/aospy/aospy/calc.pyc in _compute_full_ts(self, data_in, monthly_mean, zonal_asym)
    603         # Get results at each desired timestep and spatial point.
    604         # Here we need to provide file read-in dates (NOT xarray dates)
--> 605         full_ts, dt = self._compute(data_in, monthly_mean=monthly_mean)
    606         if zonal_asym:
    607             full_ts = full_ts - full_ts.mean(LON_STR)

/home/Spencer.Clark/aospy/aospy/calc.pyc in _compute(self, data_in, monthly_mean)
    585                     data_monthly.append(d)
    586             data_in = data_monthly
--> 587         local_ts = self._local_ts(*data_in)
    588         if self.dtype_in_time == 'inst':
    589             dt = xr.DataArray(np.ones(np.shape(local_ts[TIME_STR])),

/home/Spencer.Clark/aospy/aospy/calc.pyc in _local_ts(self, *data_in)
    569     def _local_ts(self, *data_in):
    570         """Perform the computation at each gridpoint and time index."""
--> 571         arr = self.function(*data_in)
    572         if self.var.func_input_dtype == 'numpy':
    573             arr = xr.DataArray(arr, coords=self.coords)

/home/Spencer.Clark/aospy_user/aospy_user/calcs/sah_mse_budget/energy.pyc in energy_column_vert_advec_as_resid_eta_time_mean(temp, z, q, u, v, vert_int_tdt_rad_imr, flux_t, flux_lhe, condensation_rain, convection_rain, ps, dp, radius, bk, pk)
    591         temp, z, q, u, v, vert_int_tdt_rad_imr,
    592         flux_t, flux_lhe, condensation_rain, convection_rain,
--> 593         ps, dp, radius, bk, pk
    594     ), dp)
    595

/home/Spencer.Clark/aospy_user/aospy_user/calcs/sah_mse_budget/energy.pyc in energy_horiz_advec_eta_upwind_adj_time_mean(temp, z, q, u, v, vert_int_tdt_rad_imr, flux_t, flux_lhe, condensation_rain, convection_rain, ps, dp, radius, bk, pk, order)
    445                                                u_adj, v_adj, ps])
    446     return SphereEtaUpwind(energy(tm, zm, qm, um, vm), pk, bk, psm,
--> 447                            order=order).advec_horiz_const_p(um, vm)
    448
    449

/home/Spencer.Clark/infinite-diff/infinite_diff/advec/phys.py in advec_horiz_const_p(self, u, v)
    284
    285     def advec_horiz_const_p(self, u, v):
--> 286         return self.advec_x_const_p(u) + self.advec_y_const_p(v)
    287
    288     def advec_z(self, omega):

/home/Spencer.Clark/infinite-diff/infinite_diff/advec/phys.py in advec_y_const_p(self, v)
    281         return self._Y_ADVEC_CLS(v, self.arr,
    282                                  *self._advec_args,
--> 283                                  **self._advec_y_kwargs).advec(oper='grad')
    284
    285     def advec_horiz_const_p(self, u, v):

/home/Spencer.Clark/infinite-diff/infinite_diff/advec/phys.py in advec(self, *args, **kwargs)
     62         :param flow: Flow that is advecting the field.
     63         """
---> 64         bwd, fwd = self._derivs_bwd_fwd(*args, **kwargs)
     65         neg, pos = self._flow_neg_pos()
     66         advec_arr = pos*bwd + neg*fwd

/home/Spencer.Clark/infinite-diff/infinite_diff/advec/phys.py in _derivs_bwd_fwd(self, *args, **kwargs)
     50         if getattr(self, 'cyclic', False):
     51             return bwd, fwd
---> 52         return self._swap_bwd_fwd_edges(bwd, fwd)
     53
     54     def advec(self, *args, **kwargs):

/home/Spencer.Clark/infinite-diff/infinite_diff/advec/upwind.pyc in _swap_bwd_fwd_edges(self, bwd, fwd)
     52         edge_left = {self.dim: 0}
     53         edge_right = {self.dim: -1}
---> 54         bwd[edge_left] = fwd[edge_left]
     55         fwd[edge_right] = bwd[edge_right]
     56         return bwd, fwd

/home/skc/anaconda3/envs/py27/lib/python2.7/site-packages/xarray/core/dataarray.pyc in __setitem__(self, key, value)
    405         else:
    406             # orthogonal array indexing
--> 407             self.variable[key] = value
    408
    409     def __delitem__(self, key):

/home/skc/anaconda3/envs/py27/lib/python2.7/site-packages/xarray/core/variable.pyc in __setitem__(self, key, value)
    370         key = self._item_key_to_tuple(key)
    371         if isinstance(self._data, dask_array_type):
--> 372             raise TypeError("this variable's data is stored in a dask array, "
    373                             'which does not support item assignment. To '
    374                             'assign to this variable, you must first load it '

TypeError: this variable's data is stored in a dask array, which does not support item assignment. To assign to this variable, you must first load it into memory explicitly using the .load_data() method or accessing its .values attribute.

Answer 1 · 2016-03-11T19:04:56.000Z

Huh, I have no idea why it's using dask. That has never occurred before for me.

which I think ultimately stems from something within aospy or aospy_user

I agree. Still very odd though. Have you found a workaround? Or is this holding you up?

Yes, dask sounds really promising in the long run, but not on my radar to implement it anytime soon.

Answer 2 · 2017-05-21T17:27:46.000Z

@spencerahill I'm not sure if you have a pressing use for it currently, but I just wanted to throw some examples up (see this gist) of computing derivative approximations (to arbitrary order of accuracy) using a combination of:

The functions in this notebook enable derivatives to be computed on both DataArrays containing dask arrays or numpy arrays; importantly it assumes the spacing between points along the dimension one takes a derivative along is constant. I think one could extend these functions to compute upwind estimates of the derivatives as well (as is enabled in infinite-diff), but I haven't done it yet.

A couple (minor) caveats to this approach:

xarray.core.computation.apply_ufunc is currently not public API, so this might be a little ahead of its time, but as I understand it, it is expected to be at some point in the future (it's hard to resist using it for use-cases like this!).
Using sympy.calculus.finite_diff_weights might also be a little too magical, but the code in the notebook could easily be simplified to use hard-coded weights for scipy.ndimage.filters.correlate1d to make things a little more explicit.

I was playing around with these recently for another use-case, but I figured you might be interested (at least in the implementation details in xarray and dask).

Answer 3 · 2017-05-22T03:27:51.000Z

@spencerkclark wow, thanks, this looks really cool! I should have time within a week or so to look into it more carefully. Will get back to you then.

Answer 4 · 2017-05-26T17:33:31.000Z

@spencerkclark, I am floored. Although coming from you, I'm not all that surprised! This is so cool. I've been vaguely thinking for the last year or so about something along these lines, i.e. using sympy to generate arbitrary order of accuracy derivatives, and it is amazing to see it done so elegantly. And dask compatible! And with an amazing example Notebook!

My big-picture initial reaction is that I want to replace everything possible in infinite-diff with this approach. Or maybe it makes more sense to abandon infinite-diff altogether and start from scratch with this at the core?

More minor comments and questions follow. Note that I'm only superficially familiar with actually using sympy.

as I understand it, it is expected to be at some point in the future

That's my impression too; I wouldn't worry about it.

Using sympy.calculus.finite_diff_weights might also be a little too magical

I don't think so, in fact that's what makes this whole approach so elegant. Although some additional commenting describing what those calls are doing would help.

importantly it assumes the spacing between points along the dimension one takes a derivative along is constant.

How fundamental is this assumption to your approach? The sympy docs link to a paper for arbitrarily-spaced grids just below the function docstring.

def {centered,forward,backward}_diff_weights(accuracy, spacing=1.):

An OOP approach could cut down on the code repetition here. E.g. class DiffWeights with subclasses CenDiffWeights, etc. For better and worse, that's similar to what infinite-diff does.

               # TODO: Don't hard-code periodic boundary conditions
                raise NotImplementedError(
                    'xdiff currently only supports periodic boundary '
                    'conditions on dask arrays')

You have this NotImplementedError within xcorrelate1d, but it refers to xdiff. A little confusing (even though I understand that xdiff is currently the ontly thing that calls xcorrelate1d).

        weights = forward_diff_weights(accuracy, spacing)
        origin = -int(np.ceil(accuracy / 2.))

A brief comment explaining these origin definitions would be helpful.

domain = np.arange(0., accuracy + 1.) * spacing

You use this in 3 places, so worth turning it into a helper function, e.g.:

def _make_domain(accuracy, spacing):
    return np.arange(0., accuracy + 1.) * spacing

What about non-periodic datasets without dask?

Do you know why the 8th order methods using dask are noisy, whereas no others are?

Note that the sympy function used to generate the stencils can even do upwinding. This is a second order upwind difference to the first derivative:
In [2]: sympy.calculus.finite_diff_weights(1, [1, 2, 3], 3)[1][-1]
Out[2]: [1/2, -2, 3/2]

Could you explain what's going on here? From reading the docstring (admittedly briefly), it's not clear to me how this would get us upwind

Answer 5 · 2017-05-27T13:01:49.000Z

Glad to see you're enthusiastic about this approach! Thanks a lot for your comments.

My big-picture initial reaction is that I want to replace everything possible in infinite-diff with this approach. Or maybe it makes more sense to abandon infinite-diff altogether and start from scratch with this at the core?

I'm totally open to using something along these lines within infinite-diff. My initial hunch here was to keep things as functional as possible (i.e. not OOP), to mimic how many of these functions are implemented in numpy and scipy (at least for me it makes them more intuitive to use). My sense was to try and start on developing some foundational building blocks that more complex libraries (like infinite-diff, which includes logic to handle intricacies of model coordinate systems / geometries) could be built upon.

How fundamental is this assumption to your approach? The sympy docs link to a paper for arbitrarily-spaced grids just below the function docstring.

I think this is pretty fundamental to my approach. The issue lies in using scipy.ndimage.filters.correlate1d for doing the finite differencing. It's a really cool function, but it relies on a compact set of weights which you can use as a stencil to compute a finite difference at all points. If those weights were to vary from point to point (e.g. because of inconsistent grid spacing), one would have to think of a different implementation.

That being said, it would be very cool if there were a way to leverage sympy.calculus.finite_diff_weights for arbitrary non-uniform grids. Maybe once we lock down uniform grids, we can give that a try (in that case not using scipy.ndimage.filters.correlate1d).

You have this NotImplementedError within xcorrelate1d, but it refers to xdiff. A little confusing (even though I understand that xdiff is currently the ontly thing that calls xcorrelate1d).

I agree, thanks for pointing that out. I'll fix that.

A brief comment explaining these origin definitions would be helpful.

Yes, they are a little opaque. Will do.

Do you know why the 8th order methods using dask are noisy, whereas no others are?

It's not just the ones with dask; it's all the non-periodic ones. I think that this may actually have to do with floating point error (around order 10^-16). It doesn't show up in the periodic case, because I'm actually using double the grid spacing there. If I use 50 points in the non-periodic case (to make the grid spacing the same), things look smooth for the 8th order case (see the revised notebook).

Could you explain what's going on here? From reading the docstring (admittedly briefly), it's not clear to me how this would get us upwind.

Yes, I agree this is vague. I'll try and put together a concrete example (as well as add some discussion of the origin definitions) in the next week or so.

Answer 6 · 2017-05-28T20:03:13.000Z

My initial hunch here was to keep things as functional as possible (i.e. not OOP), to mimic how many of these functions are implemented in numpy and scipy (at least for me it makes them more intuitive to use).

This makes sense. I'm starting to feel like infinite-diff, even if its internals remain highly OOP, would be better if the public API was purely functions that internally rely on the various classes, rather than being through instantiating the objects themselves. I think the same would hold for your new approach.

Maybe once we lock down uniform grids, we can give that a try (in that case not using scipy.ndimage.filters.correlate1d).

I agree. I think this means though that we should be careful to check the grid and warn prominently or maybe even raise whenever it's not uniform.

If I use 50 points in the non-periodic case (to make the grid spacing the same), things look smooth for the 8th order case (see the revised notebook).

That's awesome! Good indicator that your numerics are solid when floating point precision starts to limit your accuracy 😄

So all that being said, what do you think is the best way forward?

Answer 7 · 2017-06-03T01:47:27.000Z

@spencerahill I've gone ahead and implemented an upwind differencing function (again for arbitrary accuracy, see the new notebook). It's dask-compatible because I basically just built it as an extension to the xdiff function (adding upwind_right and upwind_left options to xdiff and then making a separate function (xdiff_upwind) to compute both sides and select which estimate to use at each point based on the sign of the velocity field). Scroll down to the bottom of the notebook to see xdiff_upwind in action.

I also cleaned up the logic to compute the origin for correlate1d a tiny bit. Hopefully it's a little more transparent now; for more on what this parameter controls I recommend taking a look at this page in the scipy docs describing filter functions, in particular how they relate to finite differences. The origin parameter basically controls how much to shift the differencing stencil relative to the index you're computing the derivative at.

So all that being said, what do you think is the best way forward?

My sense is to try and work these into infinite-diff and go from there. Does that sound reasonable? Are there any more features infinite-diff needs that could be tricky to adapt to using these functions?

Answer 8 · 2017-06-09T04:42:16.000Z

FYI not sure if it makes any difference for this, but in the new v1.13 release np.gradient now supports uneven spacing.

I made the assumption that upwind stencils always are shifted one place from their center

Sorry, can you clarify what you mean here? I think related, I'm not sure I understand the point = ... lines in the upwind_weights{left,right} functions, and in fact why these functions are needed. Why not use forward_diff_weights when v < 0 and backward_diff_weights when v > 0?

My sense is to try and work these into infinite-diff and go from there. Does that sound reasonable?

Definitely!

Are there any more features infinite-diff needs that could be tricky to adapt to using these functions?

Not that I can think of; derivatives and upwind advection are my only use cases. It's possible that adapting this generic functionality to different geometries (e.g. the sphere) will require some departures from the current approach. But honestly this is so much better than what I have now I will gladly re-do anything that this causes problems for!

Answer 9 · 2017-06-09T11:39:09.000Z

FYI not sure if it makes any difference for this, but in the new v1.13 release np.gradient now supports uneven spacing.

Indeed it could be interesting to look at the implementation!

Sorry, can you clarify what you mean here? I think related, I'm not sure I understand the point = ... lines in the upwind_weights{left,right} functions, and in fact why these functions are needed. Why not use forward_diff_weights when v < 0 and backward_diff_weights when v > 0?

The issue is that higher order upwind schemes (e.g. 3rd order) do not strictly use forward and backward differences. (But again for cases higher than 3rd order I'm not totally sure if my assumption is valid).

Not that I can think of; derivatives and upwind advection are my only use cases. It's possible that adapting this generic functionality to different geometries (e.g. the sphere) will require some departures from the current approach. But honestly this is so much better than what I have now I will gladly re-do anything that this causes problems for!

Sounds good! At some point when I have a chunk of free time I'll take a dive into to this and see where it leads.

Answer 10 · 2017-06-09T14:57:20.000Z

The issue is that higher order upwind schemes (e.g. 3rd order) do not strictly use forward and backward differences. (But again for cases higher than 3rd order I'm not totally sure if my assumption is valid).

Sorry, that's my mistake...I had assumed that it was just forward or backward differencing. I tried doing some googling for stencils for higher order upwind schemes but didn't find anything easily.

At the end of the day, clearly your accuracy is improving steadily as you increase the order, and in a similar fashion as the standard differencing cases, so you're clearly either 100% correct or nearly so. Interesting that the minimum errors occur where the derivative is largest for odd orders but but at the local extrema for even orders.

Answer 11 · 2017-08-31T21:43:40.000Z

@kuchaale found a bug in xgradient, which I think I fixed in this updated gist. Copying my comment in response so it doesn't get lost (see the thread beginning here):

Hi @kuchaale,

Sorry I just saw this now (for some reason I wasn't notified by your tagging of me in your post). Many thanks for trying things out!

For right now I would say the boundary behavior of xdiff you experienced is expected, because I've only allowed for the mode='wrap' option, which is enabled by default, and the interior values look good.

But it's clear you've uncovered a bug in xgradient. I think I've fixed things in the updated gist:
In [1]: test = xr.DataArray([30., 33., 39., 36., 41., 37.], dims=['x'])
Out [1]: <xarray.DataArray (x: 6)>
array([-1.  ,  2.25,  0.75,  0.5 ,  0.25, -2.75])
Dimensions without coordinates: x

In [2]: xgradient(test, 'x', spacing=2., accuracy=2)
Out [2]: <xarray.DataArray (x: 6)>
array([ 0.75,  2.25,  0.75,  0.5 ,  0.25, -4.25])
Dimensions without coordinates: x
Note that my result using xgradient is different than NCL's on the boundaries, because I'm doing second-order accurate differencing everywhere. NCL uses second-order centered differencing on the interior and first-order forward or backward differencing on the boundaries.

In the updated gist I also added some logic in xgradient that checks to see if the underlying data is stored in a dask array; only then will it chunk the left and right components. (This way you don't need to call .compute() at the end anymore).