pydata/xarray

xr.DataSet.expand_dims axis option doesn't work

Closed this issue · 15 comments

What happened?

When I try to change the position of a new dimension added with expand_dims by setting the axis option, nothing happens.

What did you expect to happen?

I would expect this option to add new dimensions in the position I selected, as the documentation describes. I would expect setting axis=0 to give a result like this:

Frozen({'yomama': 1, 'a': 3, 'b': 3})

Minimal Complete Verifiable Example

da = xr.DataArray([[1,2,3],[4,5,6],[7,8,9]], coords={'a':[1,2,3], 'b':[1,2,3]})
ds = xr.Dataset({'da':da})
ds1 = ds.expand_dims('yomama', axis=0)
print(ds1.dims)
ds2 = ds.expand_dims('yomama', axis=2)
print(ds2.dims)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Frozen({'a': 3, 'b': 3, 'yomama': 1})
Frozen({'a': 3, 'b': 3, 'yomama': 1})

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 5.10.133+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1

xarray: 2022.10.0
pandas: 1.5.0
numpy: 1.23.3
scipy: 1.9.1
netCDF4: 1.6.1
pydap: installed
h5netcdf: 1.0.2
h5py: 3.7.0
Nio: None
zarr: 2.13.3
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: None
rasterio: 1.3.2
cfgrib: 0.9.10.2
iris: None
bottleneck: 1.3.5
dask: 2022.10.0
distributed: 2022.10.0
matplotlib: 3.6.1
cartopy: 0.21.0
seaborn: 0.12.0
numbagg: None
fsspec: 2022.8.2
cupy: None
pint: 0.19.2
sparse: 0.13.0
flox: 0.6.0
numpy_groupies: 0.9.19
setuptools: 65.5.0
pip: 22.3
conda: None
pytest: 7.1.3
IPython: 8.5.0
sphinx: None

@cmdupuis3 dimensions are not given in a particular order in the Dataset. There could be two DataArray's which have reversed dimensions for instance.

You would need to inspect the DataArray's:

da = xr.DataArray([[1,2,3],[4,5,6],[7,8,9]], coords={'a':[1,2,3], 'b':[1,2,3]})
ds = xr.Dataset({'da':da})
ds0 = ds.expand_dims('yomama', axis=0)
print(ds0.dims)
print(ds0.da.dims)
ds1 = ds.expand_dims('yomama', axis=1)
print(ds1.dims)
print(ds1.da.dims)
ds2 = ds.expand_dims('yomama', axis=2)
print(ds2.dims)
print(ds2.da.dims)
Frozen({'a': 3, 'b': 3, 'yomama': 1})
('yomama', 'a', 'b')
Frozen({'a': 3, 'b': 3, 'yomama': 1})
('a', 'yomama', 'b')
Frozen({'a': 3, 'b': 3, 'yomama': 1})
('a', 'b', 'yomama')

I wonder if we shouldn't recommend using expand_dims without axis plus a transpose afterwards if we care about dimension order? Most of xarray's functions work without making assumptions about the dimension order, and I don't think expand_dims should, either (though I might be missing something, of course)

I think it might be enough to describe this thoroughly with examples in the docstring., though I do like the solution of recommending transpose.

I mean that's fine, but in that case, the documentation is very misleading

the documentation is very misleading

Updating the docstring would be a fairly easy and impactful PR if you're up for it!

Yeah, I could put something together. It'll probably have to wait until next week though.

EDIT: Lots of confusion below about nothing, plz disregard

Okay, regardless of expected behavior here, my particular use-case requires that I transpose these dimensions. Can someone show me a way to do this? I tried to explain the xarray point of view to Keras, but Keras is really not interested ;)

I tried something like ds.expand_dims("sample").transpose('sample','nlat','nlon') to complete futility, probably something to do with the Frozen stuff if I had to guess.

Okay, regardless of expected behavior here, my particular use-case requires that I transpose these dimensions. Can someone show me a way to do this? I tried to explain the xarray point of view to Keras, but Keras is really not interested ;)

I tried something like ds.expand_dims("sample").transpose('sample','nlat','nlon') to complete futility, probably something to do with the Frozen stuff if I had to guess.

The transpose method should change the dimension order on each array in the dataset. One particularly important component from Kai's comment above is that ds.dims does not tell you information about the axis order for the DataArrays in the Dataset. Can you please describe how the DataArray dimension order reported by the code below differs from your expectations?

for var in ds.data_vars:
    print(ds[var].sizes)

Nvm, my use case isn't what I thought it was, but I'll push the issue a bit.

So I'm not disputing anything about what these functions actually do now, the issue I have is that the functions here treat the dimension order of a DataSet as if it's arbitrary, but calling [] on a DataSet slices it in a decidedly non-arbitrary way. It turns out that [] actually does care about which axis you select if you call expand_dims first, and you index with an integer like [0]. I think this inconsistency is what's confusing to me atm.

I'm not an xarray developer, but my guess is that your argument is why positional indexing/slicing is not available for datasets.

As for the specific case of using axis parameter of expand_dims, I think this is useful for the case in which the user is either confident about the axis order in each DataArray or will use label based operations such that axis order doesn’t matter. I was curious so I did a quick comparison of the speed for using this parameter versus a subsequent transpose operation:

shape = (10, 50, 100, 200)
ds = xr.Dataset(
    {
        "foo": (["time", "x", "y", "z"], np.random.rand(*shape)),
        "bar": (["time", "x", "y", "z"], np.random.randint(0, 10, shape)),
    },
    {
        "time": (["time"], np.arange(shape[0])),
        "x": (["x"], np.arange(shape[1])),
        "y": (["y"], np.arange(shape[2])),
        "z": (["z"], np.arange(shape[3])),
    },
)
%%timeit -r 4
ds1 = ds.expand_dims("sample", axis=1)

38.1 µs ± 76 ns per loop (mean ± std. dev. of 4 runs, 10,000 loops each)

%%timeit -r 4
ds2 = ds.expand_dims("sample").transpose("time", "sample", "x", "y", "z")

172 µs ± 612 ns per loop (mean ± std. dev. of 4 runs, 10,000 loops each)

Okay I think I get the philosophy now. However, indexing a DataSet with an integer actually does work. If performance is the goal, shouldn't something like ds[0] throw a warning or an error?

Okay I think I get the philosophy now. However, indexing a DataSet with an integer actually does work. If performance is the goal, shouldn't something like ds[0] throw a warning or an error?

Can you share your code for this? I would interpret that as meaning you have a variable in your dataset mapped to an integer key, which is allowed as a hashable type but can cause problems with downstream packages.

I was thinking something like this:

    da = xr.DataArray([[1,2,3],[4,5,6],[7,8,9]], coords={'a':[1,2,3], 'b':[1,2,3]})
    ds = xr.Dataset({'da':da})
    ds1 = ds.expand_dims('yomama', axis=0)
    print(ds1[0].dims)
    ds2 = ds.expand_dims('yomama', axis=2)
    print(ds2[0].dims)

...but this throws an error (like it should). I think I must be reading my code wrong lol

The xr.Dataset.expand_dims() method can be used to add new dimensions to a dataset. The axis parameter is used to specify where to insert the new dimension in the dataset. However, it's worth noting that the axis parameter only works when expanding along a 1D coordinate, not when expanding along a multi-dimensional array.

Here's an example to illustrate how to use the axis parameter to expand a dataset along a 1D coordinate:

import xarray as xr

create a sample dataset

data = xr.DataArray([[1, 2], [3, 4]], dims=('x', 'y'))
ds = xr.Dataset({'foo': data})

add a new dimension along the 'x' coordinate using the 'axis' parameter

ds_expanded = ds.expand_dims({'z': [1]}, axis='x')

In this example, we create a 2D array with dimensions x and y, and then add a new dimension along the x coordinate using the axis='x' parameter.

However, if you try to use the axis parameter to expand a dataset along a multi-dimensional array, you may encounter an error. This is because expanding along a multi-dimensional array would result in a dataset with non-unique dimension names, which is not allowed in xarray.

Here's an example to illustrate this issue:

import xarray as xr

create a sample dataset with a 2D array

data = xr.DataArray([[1, 2], [3, 4]], dims=('x', 'y'))
ds = xr.Dataset({'foo': data})

add a new dimension along the 'x' and 'y' coordinates using the 'axis' parameter

ds_expanded = ds.expand_dims({'z': [1]}, axis=('x', 'y'))

In this example, we try to use the axis=('x', 'y') parameter to add a new dimension along both the x and y coordinates. However, this results in a ValueError because the resulting dataset would have non-unique dimension names.

To add a new dimension along a multi-dimensional array, you can instead use the xr.concat() function to concatenate the dataset with a new data array along the desired dimension:

import xarray as xr

create a sample dataset with a 2D array

data = xr.DataArray([[1, 2], [3, 4]], dims=('x', 'y'))
ds = xr.Dataset({'foo': data})

add a new dimension along the 'x' and 'y' coordinates using xr.concat

ds_expanded = xr.concat([ds, xr.DataArray([1], dims=('z'))], dim='z')

In this example, we use the xr.concat() function to concatenate the original dataset with a new data array that has a single value along the new dimension z. The dim='z' parameter is used to specify that the new dimension should be named z.

Closing this now. Feel free to reopen.