BUG(?): rolling sum with pyarrow types results in float64 instead of preserving integer type

Question

BUG(?): rolling sum with pyarrow types results in float64 instead of preserving integer type

Opened this issue 13 days ago · 3 comments

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [23]: pd.Series([1,2,3], dtype='Int64[pyarrow]').rolling(2).sum()
Out[23]:
0    NaN
1    3.0
2    5.0
dtype: float64

Issue Description

Given that 'Int64[pyarrow]' supports missing values, should the above not result in

0    <NA>
1    3
2    5
dtype: Int64[pyarrow]

to avoid the usual issues around floating point numbers?

Expected Behavior

To preserve integer type

Other tools for reference:

In [16]: pl.Series([1,2,3]).rolling_sum(2)
Out[16]:
shape: (3,)
Series: '' [i64]
[
        null
        3
        5
]

In [17]: duckdb.sql("""
    ...: from values (1),(2),(3) df(a)
    ...: select case when count(a) over w >= 2 then sum(a) over w else null end as a
    ...: window w as (rows between 1 preceding and current row)
    ...: """)
Out[17]:
┌────────┐
│   a    │
│ int128 │
├────────┤
│   NULL │
│      3 │
│      5 │
└────────┘

Installed Versions

INSTALLED VERSIONS

commit : 57fd502
python : 3.10.12
python-bits : 64
OS : Linux
OS-release : 5.15.167.4-microsoft-standard-WSL2
Version : #1 SMP Tue Nov 5 00:21:55 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1979.g57fd50221e
numpy : 1.26.4
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : 3.0.12
sphinx : 8.1.3
IPython : 8.33.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.3
blosc : None
bottleneck : 1.4.2
fastparquet : 2024.11.0
fsspec : 2025.2.0
html5lib : 1.1
hypothesis : 6.127.5
gcsfs : 2025.2.0
jinja2 : 3.1.5
lxml.etree : 5.3.1
matplotlib : 3.10.1
numba : 0.61.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 19.0.1
pyreadstat : 1.2.8
pytest : 8.3.5
python-calamine : None
pytz : 2025.1
pyxlsb : 1.0.10
s3fs : 2025.2.0
scipy : 1.15.2
sqlalchemy : 2.0.38
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
xlsxwriter : 3.2.2
zstandard : 0.23.0
tzdata : 2025.1
qtpy : None
pyqt5 : None

Answer 1 · 2025-03-18T22:18:03.000Z

Agreed, we should be able to preserve the input type for rolling sum aggregations. Also min and max.

Answer 2 · 2025-03-19T04:41:54.000Z

@mroeschke @MarcoGorelli Given that groupby operations are able to correctly handle extension (and other non-float64) dtypes, I'm guessing we will need to do something similar for window aggregations -

Update all window cython aggregations to be able to handle numpy types other than float64_t (i.e. don't hardcode float64_t)
Update all window aggregations to handle masked arrays (i.e. allow extra mask and result_mask params)
Create a separate _window_op() path for extension arrays similar to _groupby_op()
Implement _window_op() for each extension array type

This seems like quite a bit of work, so please suggest if you can think of a simpler way. We may also do this in stages - implement # 1 first as it is independent of the other 3 and will resolve other issues like #23002 and partially #11446.

Answer 3 · 2025-03-19T16:02:55.000Z

The shorter way would be to astype the result from the Cython aggregations, but your steps are definitely the more thorough approach that we would want in the longer term