pandas-dev/pandas

BUG: Faulty DatetimeIndex union

Closed this issue · 3 comments

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

def test_pandas_datetime_index_union(self):
    """
    Demonstrates a suspected bug in pandas 2.2.3, where unions of DatetimeIndexes (and therefore pd.concats of dataframes with DatetimeIndexes)
    are returning unexpected values.

    My actual usecase (concatenating two dataframes with these DatetimeIndexes, from which I extracted these date ranges) works in
    pandas 1.5.3, but not 2.2.3.

    Interestingly, this passes in both versions if you change the dtype to datetime64[ns].

    """

    dti1 = DatetimeIndex(
        ['2021-10-05 17:30:00', 
         '2021-10-05 18:00:00', 
         '2021-10-05 18:30:00', 
         '2021-10-05 19:00:00', 
         '2021-10-05 19:30:00'],
        dtype='datetime64[us]', name='DATETIME', freq='30min'
    )
    dti2 = DatetimeIndex(
        ['2021-10-05 17:30:00', 
         '2021-10-05 18:00:00',  
         '2021-10-05 18:30:00', 
         '2021-10-05 19:00:00', 
         '2021-10-05 19:30:00', 
         '2021-10-05 20:00:00'], # <-- Extra datetime
        dtype='datetime64[us]', name='DATETIME', freq='30min'
    )

    union = set(dti1.union(dti2))
    expected = set(dti1) | set(dti2)
    print(f"{union=}")
    print(f"{expected=}")

    assert len(union) == len(expected), "Should have all the rows from the concatenated dataframes"


def test_range_index_equality(self):
    """ This (presumably) faulty equality check appears to be the root cause of the datetimeindex union bug above
    Note that the two stop values are different, so the RangeIndexes should not be equal.
    Interestingly, this fails in both pandas 1.5.3 and 2.2.3.
    """
    a = RangeIndex(start=1633455000000000, stop=1635262200000000, step=1800000000000)
    b = RangeIndex(start=1633455000000000, stop=1635264000000000, step=1800000000000)
    assert not a.equals(b)

Issue Description

These tests above (details in the function doc) demonstrate the issue and what I think is the root cause.
Basically we get back what appears to be an incorrect result when taking the union of two DatetimeIndexes with different ranges.

I traced this as far as the RangeIndex equality check in the second test, which appears to be faulty, returning True for two different stop values.

Expected Behavior

Out from first test should be (as in pandas 1.5.3):

union={Timestamp('2021-10-05 18:30:00'), Timestamp('2021-10-05 19:00:00'), Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-05 20:00:00'), Timestamp('2021-10-05 19:30:00'), Timestamp('2021-10-05 18:00:00')}

expected={Timestamp('2021-10-05 18:30:00'), Timestamp('2021-10-05 19:00:00'), Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-05 20:00:00'), Timestamp('2021-10-05 19:30:00'), Timestamp('2021-10-05 18:00:00')}

But the actual output in pandas 2.2.3 is (incorrectly):

union={Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-26 13:30:00')}

expected={Timestamp('2021-10-05 19:30:00'), Timestamp('2021-10-05 20:00:00'), Timestamp('2021-10-05 17:30:00'), Timestamp('2021-10-05 19:00:00'), Timestamp('2021-10-05 18:00:00'), Timestamp('2021-10-05 18:30:00')}

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5c python : 3.12.6 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Australia.1252 pandas : 2.2.3 numpy : 2.2.4 pytz : 2025.1 dateutil : 2.9.0.post0 pip : 25.0.1 Cython : None sphinx : 4.5.0 IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.13.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2025.3.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.6 lxml.etree : None matplotlib : 3.10.1 numba : None numexpr : None odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : 2.9.10 pymysql : None pyarrow : 19.0.1 pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : 2025.3.0 scipy : 1.15.2 sqlalchemy : 2.0.39 tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2025.1 qtpy : None pyqt5 : None

Update: I think pandas.core.indexes.datetimelike._as_range_index() is the real problem here: it always converts the freq to nanoseconds, but passes the first and last timestamps' _values through verbatim.

So if the timestamps are in us, we end up with a range with us and step in ns, which in my case, means that the step is so large that the ranges are actually equivalent (because the first step goes past the from param.

So I think _as_range_index() probably needs to be made aware of the time unit somehow, and not always assume ns precision. The 2-line change below worked for me, but there's probably a more robust way to do this check:

In pandas.core.indexes.datetimelike._as_range_index()

    def _as_range_index(self) -> RangeIndex:
        # Convert our i8 representations to RangeIndex
        # Caller is responsible for checking isinstance(self.freq, Tick)
        freq = cast(Tick, self.freq)
        time_unit = 'ns' if self._values._ndarray.dtype.name=='datetime64[ns]' else 'us' # check if we have ns or us values
        tick = Timedelta(freq).as_unit(time_unit)._value
        rng = range(self[0]._value, self[-1]._value + tick, tick)
        return RangeIndex(rng)

Thanks for the report. The proposed fix is essentially on main, and I get the expected output there. Closing.

Ah so it is, sorry I missed that. Thanks for checking!