IPF associated reader not able to parse datetimes beyond the year 2261
Closed this issue · 4 comments
Wells with associated timeseries in iMOD5 regional models incidentally use timestamps like 29991231 to force wells to continue extracting up until the end of the simulation period (assuming not more than ~1000 years are simulated). However, we parse these datetimes with:
...
def read_associated(path, kwargs={})
...
len_date = len(df[time_column].iloc[0])
if len_date == 14:
df[time_column] = pd.to_datetime(df[time_column], format="%Y%m%d%H%M%S")
elif len_date == 8:
df[time_column] = pd.to_datetime(df[time_column], format="%Y%m%d")
...
When given a format to pd.to_datetime
as argument, it always uses nanoseconds as units, and thus it doesn't support datetimes beyond 2261. There is a unit
argument, but this doesn't work together with the format
argument.
I think it is better here to use the imod.util.to_datetime
function here, that also checks for len_date, so we can remove the IF statement in read_associated
. This returns a datetime.datetime
object, which we can hopefully convert with pandas to a pd.datetime object with a larger unit.
Somewhat related to:
#41
I've done some simple profiling in IPython, this might work with twice the speed.
In [1]: %timeit pd.to_datetime(np.datetime64(imod.util.to_datetime("29991231"), "s"))
20.9 μs ± 254 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [2]: %timeit pd.to_datetime("20021231", format="%Y%m%d")
40.1 μs ± 445 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
On second note:
Forget my previous comment, it is faster for individual values, but for arrays it is 10 times slower, as imod.util.to_datetime has to be called for each individual element in a list comprehension, whereas pd.to_datetime is vectorized.
Furthermore, the pandas DateTimeIndex used in dataframes and to do resampling, only supports using nanoseconds as unit I read in the documentation. I've tried quite some things to see if I could generate one in some way, but couldn't figure out an easy way to do this. Alternatively, xarray has a CFTimeIndex, but this doesn't work with resampling, as xarray uses pandas for that, so that would require implementing some custom logic to do that for us.
Also interesting to note: we've never had a complaint about this with regards to IPFs. So apparently such dates do not occur (much) in use. In this case, you might be able to replace a data like 29991231 with 22611231 instead or something (and log/warn).
In general, this feels like a very serious issue. But the only workaround that I see is using xarray Datasets instead of pandas DataFrames. That's rather a lot of work.
In the meantime, I don't think it's really crippling. If you want to setup a paleo-study with wells, you can do so by using xarray for the time management.
You just can't setup a paleostudy nicely going from the project file...
Other thoughts: use numpy datetimes instead when going from project file (add a bool argument in the ipf reading)? Maybe convert temporarily to xarray to resample.
I have some fixes which make it work for the LHM (one final value in the timeseries beyond the year 2261) in #1169. If this issue pops again, feel free to re-open