pmlmodelling/nctoolkit

[JOSS] Object representation does not reflect lazy operations

Closed this issue · 6 comments

Describe the bug
Looks like I'm not able to select a single year from a NCEP dataset (or at least the representation of the DataSet object does not show the subsetting).

To Reproduce

import nctoolkit as nc
nc_ds = nc.open_url("https://github.com/pydata/xarray-data/raw/master/air_temperature.nc")
print(nc_ds)
nc_ds.subset(year = 2013)
print(nc_ds)
nctoolkit is using Climate Data Operators version 2.2.0
Downloading https://github.com/pydata/xarray-data/raw/master/air_temperature.nc

The variable air has integer data type. Consider setting data type to float 'F64' or 'F32' using set_precision.


<nctoolkit.DataSet>:
Number of files: 1
File contents:
  variable  ntimes  npoints  nlevels                                   long_name  unit data_type
0      air    2920     1325        1  4xDaily Air temperature at sigma level 995  degK       I16

<nctoolkit.DataSet>:
Number of files: 1
File contents:
  variable  ntimes  npoints  nlevels                                   long_name  unit data_type
0      air    2920     1325        1  4xDaily Air temperature at sigma level 995  degK       I16

Expected behavior
ntimes should change from 2920 to 1460

import xarray as xr
xr_ds = xr.tutorial.open_dataset("air_temperature").chunk()
print(xr_ds.dims)
xr_ds = xr_ds.sel(time="2013")
print(xr_ds.dims)
Frozen({'lat': 25, 'time': 2920, 'lon': 53})
Frozen({'lat': 25, 'time': 1460, 'lon': 53})

Desktop (please complete the following information):

  • OS: macOS
  • nctoolkit version: '0.9.3'

openjournals/joss-reviews#5494

I see, I need to run nc_ds.run() to actually see the changes.
I find this quite confusing, especially because dataset objects are modified in place and therefore it's very hard to keep track of the modifications that will be applied.

I think that the representation of the object should show the modified coordinates/dimensions/sizes (same as xarray+dask, which is also lazy), or it should show at least all the operations that will be applied when nc_ds.run() is called.

Yeah, there is some ambiguity here. I think the solution is to automatically run ds.run() when you access attributes etc. This behaviour would be more what a new user would expect. And it's not going to have any computational impacts, as you'll only really be accessing attributes interactively, not when scripting.

In theory, changes could be tracked without running commands, but that would just become very awkward book-keeping.

OK, I think it's clear now.

Very minor thing I've noticed. When I run

nc_ds = nc.open_url("https://github.com/pydata/xarray-data/raw/master/air_temperature.nc")

there's a weird string printed while downloading. (The string is the second half of the url).

That's strange. This prints OK for me on Linux.

What Python version/OS are you using?

The code is just print(f"Downloading {x}"), where x is the string of the url. So It's hard to see what could cause this

I'm using macOS. Here is the env:
nctoolkit_env.txt

I tried both python and ipython, same issue.

OK. This seems to be a shell issue. I also have this, which I remember was just to improve the printing.

print("\033[A \033[A")

This must behave differently on macs.

Printing the url you are downloading is overkill. So I've just removed that from the function in the dev version.