ec-jrc/pyPoseidon

Split meteo into daily chunks

Closed this issue · 4 comments

brey commented

Currently we have a way to split the 3 day meteo dataset in daily chunks with

                if split_by:
                    times, datasets = zip(*self.meteo.Dataset.groupby("time.{}".format(split_by)))
                    mpaths = ["sflux_air_{}.{:04d}.nc".format(m_index, t + 1) for t in np.arange(len(times))]
                    for das, mpath in list(zip(datasets, mpaths)):
                        self.to_force(
                            das, vars=["msl", "u10", "v10"], rpath=path, filename=mpath, date=self.rdate, **kwargs
                        )
                else:
                    self.to_force(self.meteo.Dataset, vars=["msl", "u10", "v10"], rpath=path, **kwargs)

which outputs 4 daily chunks.

However, there is a problem with the groupby functionality.

One can see it below:

import pandas as pd

import xarray as xr

tt = pd.date_range('2023-5-30','2023-6-2')

tt.values

    array(['2023-05-30T00:00:00.000000000', '2023-05-31T00:00:00.000000000',
           '2023-06-01T00:00:00.000000000', '2023-06-02T00:00:00.000000000'],
          dtype='datetime64[ns]')

d = xr.Dataset(
    {
        "time": (("time"), tt.values),
    }
)

d.groupby("time.day")

    DatasetGroupBy, grouped over 'day'
    4 groups with labels 1, 2, 30, 31.

d.groupby("time.dayofyear")

    DatasetGroupBy, grouped over 'dayofyear'
    4 groups with labels 150, 151, 152, 153.

tt2 = pd.date_range('2023-12-30','2024-1-2')

d2 = xr.Dataset(
    {
        "time": (("time"), tt2.values),
    }
)

d2.groupby("time.dayofyear")

    DatasetGroupBy, grouped over 'dayofyear'
    4 groups with labels 1, 2, 364, 365.

Using dayofyear solves it for the monthly occurrence but we get the same for cross years.

@pmav99 @tomsail Any ideas?

pmav99 commented
times = pd.date_range("2023-12-30", "2024-01-02", freq="h")
ds = xr.Dataset(
    {
        "time": (("time"), times),
    }
)
ds.resample(time="1D")

source: https://stackoverflow.com/a/54433261/592289

brey commented

@pmav99 It works. Since we have no gaps there is no problem with resample. Thanks!

pmav99 commented

By the way, since from our tests we saw that split by day gives some performance boost, should we perhaps enable it by default?

brey commented

I don't think so. If there is a gap in time, resample will interpolate, thus increasing the wall time. Also, if the file is small (regional window) there might not be a benefit. We should document this well and let the user decide.