pandas-dev/pandas

BUG(?): rolling sum with pyarrow types results in float64 instead of preserving integer type

Opened this issue · 3 comments

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [23]: pd.Series([1,2,3], dtype='Int64[pyarrow]').rolling(2).sum()
Out[23]:
0    NaN
1    3.0
2    5.0
dtype: float64

Issue Description

Given that 'Int64[pyarrow]' supports missing values, should the above not result in

0    <NA>
1    3
2    5
dtype: Int64[pyarrow]

to avoid the usual issues around floating point numbers?

Expected Behavior

To preserve integer type

Other tools for reference:

In [16]: pl.Series([1,2,3]).rolling_sum(2)
Out[16]:
shape: (3,)
Series: '' [i64]
[
        null
        3
        5
]

In [17]: duckdb.sql("""
    ...: from values (1),(2),(3) df(a)
    ...: select case when count(a) over w >= 2 then sum(a) over w else null end as a
    ...: window w as (rows between 1 preceding and current row)
    ...: """)
Out[17]:
┌────────┐
│   a    │
│ int128 │
├────────┤
│   NULL │
│      3 │
│      5 │
└────────┘

Installed Versions

INSTALLED VERSIONS

commit : 57fd502
python : 3.10.12
python-bits : 64
OS : Linux
OS-release : 5.15.167.4-microsoft-standard-WSL2
Version : #1 SMP Tue Nov 5 00:21:55 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1979.g57fd50221e
numpy : 1.26.4
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : 3.0.12
sphinx : 8.1.3
IPython : 8.33.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.3
blosc : None
bottleneck : 1.4.2
fastparquet : 2024.11.0
fsspec : 2025.2.0
html5lib : 1.1
hypothesis : 6.127.5
gcsfs : 2025.2.0
jinja2 : 3.1.5
lxml.etree : 5.3.1
matplotlib : 3.10.1
numba : 0.61.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 19.0.1
pyreadstat : 1.2.8
pytest : 8.3.5
python-calamine : None
pytz : 2025.1
pyxlsb : 1.0.10
s3fs : 2025.2.0
scipy : 1.15.2
sqlalchemy : 2.0.38
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
xlsxwriter : 3.2.2
zstandard : 0.23.0
tzdata : 2025.1
qtpy : None
pyqt5 : None

Agreed, we should be able to preserve the input type for rolling sum aggregations. Also min and max.

@mroeschke @MarcoGorelli Given that groupby operations are able to correctly handle extension (and other non-float64) dtypes, I'm guessing we will need to do something similar for window aggregations -

  1. Update all window cython aggregations to be able to handle numpy types other than float64_t (i.e. don't hardcode float64_t)
  2. Update all window aggregations to handle masked arrays (i.e. allow extra mask and result_mask params)
  3. Create a separate _window_op() path for extension arrays similar to _groupby_op()
  4. Implement _window_op() for each extension array type

This seems like quite a bit of work, so please suggest if you can think of a simpler way. We may also do this in stages - implement # 1 first as it is independent of the other 3 and will resolve other issues like #23002 and partially #11446.

The shorter way would be to astype the result from the Cython aggregations, but your steps are definitely the more thorough approach that we would want in the longer term