BUG: Faulty DatetimeIndex union
Closed this issue · 3 comments
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
def test_pandas_datetime_index_union(self):
"""
Demonstrates a suspected bug in pandas 2.2.3, where unions of DatetimeIndexes (and therefore pd.concats of dataframes with DatetimeIndexes)
are returning unexpected values.
My actual usecase (concatenating two dataframes with these DatetimeIndexes, from which I extracted these date ranges) works in
pandas 1.5.3, but not 2.2.3.
Interestingly, this passes in both versions if you change the dtype to datetime64[ns].
"""
dti1 = DatetimeIndex(
['2021-10-05 17:30:00',
'2021-10-05 18:00:00',
'2021-10-05 18:30:00',
'2021-10-05 19:00:00',
'2021-10-05 19:30:00'],
dtype='datetime64[us]', name='DATETIME', freq='30min'
)
dti2 = DatetimeIndex(
['2021-10-05 17:30:00',
'2021-10-05 18:00:00',
'2021-10-05 18:30:00',
'2021-10-05 19:00:00',
'2021-10-05 19:30:00',
'2021-10-05 20:00:00'], # <-- Extra datetime
dtype='datetime64[us]', name='DATETIME', freq='30min'
)
union = set(dti1.union(dti2))
expected = set(dti1) | set(dti2)
print(f"{union=}")
print(f"{expected=}")
assert len(union) == len(expected), "Should have all the rows from the concatenated dataframes"
def test_range_index_equality(self):
""" This (presumably) faulty equality check appears to be the root cause of the datetimeindex union bug above
Note that the two stop values are different, so the RangeIndexes should not be equal.
Interestingly, this fails in both pandas 1.5.3 and 2.2.3.
"""
a = RangeIndex(start=1633455000000000, stop=1635262200000000, step=1800000000000)
b = RangeIndex(start=1633455000000000, stop=1635264000000000, step=1800000000000)
assert not a.equals(b)
Issue Description
These tests above (details in the function doc) demonstrate the issue and what I think is the root cause.
Basically we get back what appears to be an incorrect result when taking the union of two DatetimeIndexes with different ranges.
I traced this as far as the RangeIndex equality check in the second test, which appears to be faulty, returning True for two different stop
values.
Expected Behavior
Out from first test should be (as in pandas 1.5.3):
union={Timestamp('2021-10-05 18:30:00'), Timestamp('2021-10-05 19:00:00'), Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-05 20:00:00'), Timestamp('2021-10-05 19:30:00'), Timestamp('2021-10-05 18:00:00')}
expected={Timestamp('2021-10-05 18:30:00'), Timestamp('2021-10-05 19:00:00'), Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-05 20:00:00'), Timestamp('2021-10-05 19:30:00'), Timestamp('2021-10-05 18:00:00')}
But the actual output in pandas 2.2.3 is (incorrectly):
union={Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-26 13:30:00')}
expected={Timestamp('2021-10-05 19:30:00'), Timestamp('2021-10-05 20:00:00'), Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-05 19:00:00'), Timestamp('2021-10-05 18:00:00'), Timestamp('2021-10-05 18:30:00')}
Installed Versions
Update: I think pandas.core.indexes.datetimelike._as_range_index()
is the real problem here: it always converts the freq
to nanoseconds, but passes the first and last timestamps' _value
s through verbatim.
So if the timestamps are in us
, we end up with a range
with us
and step
in ns
, which in my case, means that the step is so large that the ranges are actually equivalent (because the first step goes past the from
param.
So I think _as_range_index()
probably needs to be made aware of the time unit somehow, and not always assume ns
precision. The 2-line change below worked for me, but there's probably a more robust way to do this check:
In pandas.core.indexes.datetimelike._as_range_index()
def _as_range_index(self) -> RangeIndex:
# Convert our i8 representations to RangeIndex
# Caller is responsible for checking isinstance(self.freq, Tick)
freq = cast(Tick, self.freq)
time_unit = 'ns' if self._values._ndarray.dtype.name=='datetime64[ns]' else 'us' # check if we have ns or us values
tick = Timedelta(freq).as_unit(time_unit)._value
rng = range(self[0]._value, self[-1]._value + tick, tick)
return RangeIndex(rng)
Thanks for the report. The proposed fix is essentially on main, and I get the expected output there. Closing.
Ah so it is, sorry I missed that. Thanks for checking!