PERF: Arrow dtypes are much slower than Numpy for DataFrame.apply

Question

PERF: Arrow dtypes are much slower than Numpy for DataFrame.apply

Opened this issue 4 months ago · 8 comments

The same code with DataFrame.apply is >4x slower when the data is in Arrow dtypes versus Numpy.

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
import pyarrow as pa
import time

NUM_ROWS = 500_000
df = pd.DataFrame({"A": np.arange(NUM_ROWS) % 30, "B": np.arange(NUM_ROWS)+1.0})
print(df.dtypes)
df2 = df.astype({"A": pd.ArrowDtype(pa.int64()), "B": pd.ArrowDtype(pa.float64())})
print(df2.dtypes)

t0 = time.time()
df.apply(lambda r: 0 if r.A == 0 else (r.B // r.A), axis=1)
print(f"Non-Arrow time: {time.time() - t0:.2f} seconds")

t0 = time.time()
df2.apply(lambda r: 0 if r.A == 0 else (r.B // r.A), axis=1)
print(f"Arrow time: {time.time() - t0:.2f} seconds")

Output with Pandas 2.3 on a local M1 Mac (tested on main branch too).

A      int64
B    float64
dtype: object
A     int64[pyarrow]
B    double[pyarrow]
dtype: object
Non-Arrow time: 3.21 seconds
Arrow time: 16.66 seconds

Installed Versions

INSTALLED VERSIONS

commit : 2cc3762
python : 3.13.5
python-bits : 64
OS : Darwin
OS-release : 24.3.0
Version : Darwin Kernel Version 24.3.0: Thu Jan 2 20:24:16 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.3.0
numpy : 2.2.6
pytz : 2025.2
dateutil : 2.9.0.post0
pip : 25.1.1
Cython : 3.1.2
sphinx : None
IPython : 9.3.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.4
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.5.1
html5lib : None
hypothesis : None
gcsfs : 2025.5.1
jinja2 : None
lxml.etree : None
matplotlib : 3.10.3
numba : 0.61.2
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 19.0.0
pyreadstat : None
pytest : 8.4.1
python-calamine : None
pyxlsb : None
s3fs : 2025.5.1
scipy : 1.15.2
sqlalchemy : 2.0.41
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlsxwriter : 3.2.5
zstandard : 0.23.0
tzdata : 2025.2
qtpy : None
pyqt5 : None

Prior Performance

No response

Answer 1 · 2025-07-01T01:17:20.000Z

Any hotspots show up in profiling?

Answer 2 · 2025-07-01T01:28:44.000Z

Profiler output is a bit hard to read as usual. Here are some snakeviz screenshots. The fast_xs function has different code paths for ExtensionDtype that look suspicious to me. find_common_type and _from_sequence stand out looks like.

Answer 3 · 2025-07-01T02:02:20.000Z

That's a tough one. In core.apply.FrameColumnApply.series_generator we have a fastpath that only works with numpy dtypes.

We might be able to get some mileage for EA Dtypes by changing

            for i in range(len(obj)):
                yield obj._ixs(i, axis=0)

to something like

dtype = ser.dtype
for i in range(len(obj)):
    new_vals = the_part_of_fast_xs_after_interleaved_dtype_is_called()
    new_arr = type(ser.array)._from_sequence(new_vals, dtype=dtype)
    yield obj._constructor(new_arr, name=name)

That would save 20% of the runtime by avoiding the interleaved_dtype calls.

Answer 4 · 2025-07-01T13:17:21.000Z

This difference makes sense but what's confusing is that the performance issue goes away if one of the column is changed to string:

import pandas as pd
import numpy as np
import pyarrow as pa
import time

NUM_ROWS = 500_000
df = pd.DataFrame({"A": np.arange(NUM_ROWS) % 30, "B": np.arange(NUM_ROWS).astype(str)})
print(df.dtypes)
df2 = df.astype({"A": pd.ArrowDtype(pa.int64()), "B": pd.ArrowDtype(pa.large_string())})
print(df2.dtypes)

t0 = time.time()
df.apply(lambda r: 0 if r.A == 0 else (int(r.B) // r.A), axis=1)
print(f"Non-Arrow time: {time.time() - t0:.2f} seconds")

t0 = time.time()
df2.apply(lambda r: 0 if r.A == 0 else (int(r.B) // r.A), axis=1)
print(f"Arrow time: {time.time() - t0:.2f} seconds")

A     int64
B    object
dtype: object
A           int64[pyarrow]
B    large_string[pyarrow]
dtype: object
Non-Arrow time: 3.35 seconds
Arrow time: 3.21 seconds

Answer 5 · 2025-07-01T14:07:38.000Z

id have to look at the profiling output but my prior puts a lot of weight on "object dtype is just that bad"

Answer 6 · 2025-07-01T14:11:59.000Z

But in this case object dtype is basically the same as Numpy numeric dtype in the no-arrow cases (3.35 vs 3.21, see first numbers in the two outputs). The difference is that pa.large_string() is a lot better than pa.float64() in the Arrow cases.

Answer 7 · 2025-07-01T14:54:02.000Z

In this case df.iloc[0] has an object dtype even when you have pyarrow dtypes, so it the iteration in series_generator goes through the numpy fastpath

(Looking at profiling results, I think we can trim a bunch by changing is_object_dtype check in _can_hold_identifiers_and_holds_name to just self.dtype == object)

Answer 8 · 2025-07-01T15:30:09.000Z

Ok, makes sense. Thanks for the explanation.