FEEDBACK: PyArrow as a required dependency and PyArrow backed strings

Question

FEEDBACK: PyArrow as a required dependency and PyArrow backed strings

phofl opened this issue 2 years ago · 154 comments

This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.

The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

If you would like to filter this warning without installing pyarrow at this time, please view this comment: #54466 (comment)

aman123shampy commented a year ago

thanks

Answer 1 · 2023-08-11T05:29:06.000Z

Something that hasn't received enough attention/discussion, at least in my mind, is this piece of the Drawbacks section of the PDEP (bolding added by me):

Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow using pip from wheels, numpy and pandas requires about 70MB, and including PyArrow requires an additional 120MB. An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments such as AWS Lambda.

I honestly don't understand how mandating a 170% increase in the effective size of a pandas installation (70MB to 190MB, from the numbers in the quoted text) can be considered okay.

For that kind of increase, I would expect/want the tradeoff to be major improvements across the board. Instead, this change comes with limited benefit but massive bloat for anyone who doesn't need the features PyArrow enables, e.g. for those who don't have issues with the current functionality of pandas.

Answer 2 · 2023-08-14T07:09:04.000Z

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)

I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).

Answer 3 · 2023-08-16T20:42:17.000Z

For that kind of increase, I would expect/want the tradeoff to be major improvements across the board.

Yeah unfortunately this is where the subjective tradeoff comes into effect. pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively. The hope with pyarrow is that the tradeoff improves the current functionality for common "object" types in pandas such as text, binary, decimal, and nested data.

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible.

AFAIK most pydata projects don't actually publish/manage Linux system packages for their respective libraries. Do you know how these are packaged today?

Answer 4 · 2023-08-16T21:18:43.000Z

pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively.

The pytz and dateutil wheels are only ~500kb. Drawing a comparison between them and PyArrow seems like a stretch, to put it lightly.

Answer 5 · 2023-08-16T21:26:21.000Z

Do you know how these are packaged today?

By whoever offers to do it, currently me for pandas. Of the pydata projects, Debian currently has pydata-sphinx-theme, sparse, patsy, xarray and numexpr.

An old discussion thread (anyone can post there, but be warned that doing so will expose your non-spam-protected email address) suggests that there is existing work on a pyarrow Debian package, but I don't yet know whether it ever got far enough to work.

Answer 6 · 2023-08-18T07:53:15.000Z

I do intend to investigate this further at some point - I haven't done so yet because Debian updated numexpr to 2.8.5, breaking pandas (#54449 / #54546), and fixing that is currently more urgent.

Answer 7 · 2023-08-18T20:22:30.000Z

Hi,

Thanks for welcoming feedback from the community.

While I respect you decision, I am afraid that making pyarrow a required dependency will come with costly consequences for users and downstream libraries' developers and maintainers for two reasons:

installing pyarrow after pandas in a fresh conda environment increases its size from approximately 100MiB to approximately 500 MiB.

Packages size

libgoogle-cloud-2.12.0-h840a212_1 :                 46106632 bytes,
python-3.11.4-hab00c5b_0_cpython :                  30679695 bytes,
libarrow-12.0.1-h10ac928_8_cpu :                    27696900 bytes,
ucx-1.14.1-h4a2ce2d_3 :                             15692979 bytes,
pandas-2.0.3-py311h320fe9a_1 :                      14711359 bytes,
numpy-1.25.2-py311h64a7726_0 :                      8139293 bytes,
libgrpc-1.56.2-h3905398_1 :                         6331805 bytes,
libopenblas-0.3.23-pthreads_h80387f5_0 :            5406072 bytes,
aws-sdk-cpp-1.10.57-h85b1a90_19 :                   4055495 bytes,
pyarrow-12.0.1-py311h39c9aba_8_cpu :                3989550 bytes,
libstdcxx-ng-13.1.0-hfd8a6a1_0 :                    3847887 bytes,
rdma-core-28.9-h59595ed_1 :                         3735644 bytes,
libthrift-0.18.1-h8fd135c_2 :                       3584078 bytes,
tk-8.6.12-h27826a3_0 :                              3456292 bytes,
openssl-3.1.2-hd590300_0 :                          2646546 bytes,
libprotobuf-4.23.3-hd1fb520_0 :                     2506133 bytes,
libgfortran5-13.1.0-h15d22d2_0 :                    1437388 bytes,
pip-23.2.1-pyhd8ed1ab_0 :                           1386212 bytes,
krb5-1.21.2-h659d440_0 :                            1371181 bytes,
libabseil-20230125.3-cxx17_h59595ed_0 :             1240376 bytes,
orc-1.9.0-h385abfd_1 :                              1020883 bytes,
ncurses-6.4-hcb278e6_0 :                            880967 bytes,
pygments-2.16.1-pyhd8ed1ab_0 :                      853439 bytes,
jedi-0.19.0-pyhd8ed1ab_0 :                          844518 bytes,
libsqlite-3.42.0-h2797004_0 :                       828910 bytes,
libgcc-ng-13.1.0-he5830b7_0 :                       776294 bytes,
ld_impl_linux-64-2.40-h41732ed_0 :                  704696 bytes,
libnghttp2-1.52.0-h61bc06f_0 :                      622366 bytes,
ipython-8.14.0-pyh41d4057_0 :                       583448 bytes,
bzip2-1.0.8-h7f98852_4 :                            495686 bytes,
setuptools-68.1.2-pyhd8ed1ab_0 :                    462324 bytes,
zstd-1.5.2-hfc55251_7 :                             431126 bytes,
libevent-2.1.12-hf998b51_1 :                        427426 bytes,
libgomp-13.1.0-he5830b7_0 :                         419184 bytes,
xz-5.2.6-h166bdaf_0 :                               418368 bytes,
libcurl-8.2.1-hca28451_0 :                          372511 bytes,
s2n-1.3.48-h06160fa_0 :                             369441 bytes,
aws-crt-cpp-0.21.0-hb942446_5 :                     320415 bytes,
readline-8.2-h8228510_1 :                           281456 bytes,
libssh2-1.11.0-h0841786_0 :                         271133 bytes,
prompt-toolkit-3.0.39-pyha770c72_0 :                269068 bytes,
libbrotlienc-1.0.9-h166bdaf_9 :                     265202 bytes,
python-dateutil-2.8.2-pyhd8ed1ab_0 :                245987 bytes,
re2-2023.03.02-h8c504da_0 :                         201211 bytes,
aws-c-common-0.9.0-hd590300_0 :                     197608 bytes,
aws-c-http-0.7.11-h00aa349_4 :                      194366 bytes,
pytz-2023.3-pyhd8ed1ab_0 :                          186506 bytes,
aws-c-mqtt-0.9.3-hb447be9_1 :                       162493 bytes,
aws-c-io-0.13.32-h4a1a131_0 :                       154523 bytes,
ca-certificates-2023.7.22-hbcca054_0 :              149515 bytes,
lz4-c-1.9.4-hcb278e6_0 :                            143402 bytes,
python-tzdata-2023.3-pyhd8ed1ab_0 :                 143131 bytes,
libedit-3.1.20191231-he28a2e2_2 :                   123878 bytes,
keyutils-1.6.1-h166bdaf_0 :                         117831 bytes,
tzdata-2023c-h71feb2d_0 :                           117580 bytes,
gflags-2.2.2-he1b5a44_1004 :                        116549 bytes,
glog-0.6.0-h6f12383_0 :                             114321 bytes,
c-ares-1.19.1-hd590300_0 :                          113362 bytes,
libev-4.33-h516909a_1 :                             106190 bytes,
aws-c-auth-0.7.3-h28f7589_1 :                       101677 bytes,
libutf8proc-2.8.0-h166bdaf_0 :                      101070 bytes,
traitlets-5.9.0-pyhd8ed1ab_0 :                      98443 bytes,
aws-c-s3-0.3.14-hf3aad02_1 :                        86553 bytes,
libexpat-2.5.0-hcb278e6_1 :                         77980 bytes,
libbrotlicommon-1.0.9-h166bdaf_9 :                  71065 bytes,
parso-0.8.3-pyhd8ed1ab_0 :                          71048 bytes,
libzlib-1.2.13-hd590300_5 :                         61588 bytes,
libffi-3.4.2-h7f98852_5 :                           58292 bytes,
wheel-0.41.1-pyhd8ed1ab_0 :                         57374 bytes,
aws-c-event-stream-0.3.1-h2e3709c_4 :               54050 bytes,
aws-c-sdkutils-0.1.12-h4d4d85c_1 :                  53123 bytes,
aws-c-cal-0.6.1-hc309b26_1 :                        50923 bytes,
aws-checksums-0.1.17-h4d4d85c_1 :                   50001 bytes,
pexpect-4.8.0-pyh1a96a4e_2 :                        48780 bytes,
libnuma-2.0.16-h0b41bf4_1 :                         41107 bytes,
snappy-1.1.10-h9fff704_0 :                          38865 bytes,
typing_extensions-4.7.1-pyha770c72_0 :              36321 bytes,
libuuid-2.38.1-h0b41bf4_0 :                         33601 bytes,
libbrotlidec-1.0.9-h166bdaf_9 :                     32567 bytes,
libnsl-2.0.0-h7f98852_0 :                           31236 bytes,
wcwidth-0.2.6-pyhd8ed1ab_0 :                        29133 bytes,
asttokens-2.2.1-pyhd8ed1ab_0 :                      27831 bytes,
stack_data-0.6.2-pyhd8ed1ab_0 :                     26205 bytes,
executing-1.2.0-pyhd8ed1ab_0 :                      25013 bytes,
_openmp_mutex-4.5-2_gnu :                           23621 bytes,
libgfortran-ng-13.1.0-h69a702a_0 :                  23182 bytes,
libcrc32c-1.1.2-h9c3ff4c_0 :                        20440 bytes,
aws-c-compression-0.2.17-h4d4d85c_2 :               19105 bytes,
ptyprocess-0.7.0-pyhd3deb0d_0 :                     16546 bytes,
pure_eval-0.2.2-pyhd8ed1ab_0 :                      14551 bytes,
libblas-3.9.0-17_linux64_openblas :                 14473 bytes,
liblapack-3.9.0-17_linux64_openblas :               14408 bytes,
libcblas-3.9.0-17_linux64_openblas :                14401 bytes,
six-1.16.0-pyh6c4a22f_0 :                           14259 bytes,
backcall-0.2.0-pyh9f0ad1d_0 :                       13705 bytes,
matplotlib-inline-0.1.6-pyhd8ed1ab_0 :              12273 bytes,
decorator-5.1.1-pyhd8ed1ab_0 :                      12072 bytes,
backports.functools_lru_cache-1.6.5-pyhd8ed1ab_0 :  11519 bytes,
pickleshare-0.7.5-py_1003 :                         9332 bytes,
prompt_toolkit-3.0.39-hd8ed1ab_0 :                  6731 bytes,
backports-1.0-pyhd8ed1ab_3 :                        5950 bytes,
python_abi-3.11-3_cp311 :                           5682 bytes,
_libgcc_mutex-0.1-conda_forge :                     2562 bytes,

pyarrow also depends on libarrow which itself depends on several notable C and C++ libraries. This constraints the installation of other packages whose dependencies might be incompatible with libarrow's, making pandas potentially unusable in some context.

Have you considered those two observations as drawbacks before taking the decision?

Answer 8 · 2023-08-18T20:32:05.000Z

Hi,

Thanks for welcoming feedback from the community.

While I respect you decision, I am afraid that making pyarrow a required dependency will come with costly consequences for users and downstream libraries' developers and maintainers for two reasons:

installing pyarrow after pandas in a fresh conda environment increases its size from approximately 100MiB to approximately 500 MiB.

Packages size

pyarrow also depends on libarrow which itself depends on several notable C and C++ libraries. This constraints the installation of other packages whose dependencies might be incompatible with libarrow's, making pandas potentially unusable in some context.

Have you considered those two observations as drawbacks before taking the decision?

This is discussed a bit in https://github.com/pandas-dev/pandas/pull/52711/files#diff-3fc3ce7b7d119c90be473d5d03d08d221571c67b4f3a9473c2363342328535b2R179-R193
(for pip only I guess).

While currently the build size for pyarrow is pretty large, it doesn't "have" to be that big. I think by pandas 3.0
(when pyarrow will actually become required), at least some components will be spun out/made optional/something like that (I heard that the arrow people were talking about this).

(cc @jorisvandenbossche for more info on this)

I'm not an Arrow dev myself, but if is something that just needs someone to look at, I'm happy to put some time in help give Arrow a nudge in the right direction.

Finally, for clarity purposes, is the reason for concern also AWS lambda/pyodide/Alpine, or something else?

(IMO, outside of stuff like lambda funcs, pyarrow isn't too egregious in terms of package size compared to pytorch/tensorflow but it's definetely something that can be improved)

Answer 9 · 2023-08-18T20:49:13.000Z

If libarrow is slimmed down by having non-essential Arrow features be extracted into other libraries which could be optional dependencies, I think most people's concerns would be addressed.

Edit: See conda-forge/arrow-cpp-feedstock#1035

Answer 10 · 2023-08-22T07:16:22.000Z

Hi,

Thanks for welcoming feedback from the community.
For wasm builds of python / python-packages (ie pyodide / emscripten-forge) package size really matters since these packages have to be downloaded from within the browser. Once a package is too big, usability suffers drastically.

With pyarrow as a required dependency, pandas is less usable from python in the browser.

Answer 11 · 2023-08-30T15:36:08.000Z

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)

I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).

There is another way - use virtual environments in user space instead of system python. The Python Software Foundation recommends users create virtual environments; and Debian/Ubuntu want users to leave the system python untouched to avoid breaking system python.

Perhaps Pandas could add some warnings or error messages on install to steer people to virtualenv. This approach might avoid or at least defer work of adding pyarrow to APT as well as the risks of users breaking system python. Also which I'm building projects I might want a much later version of pandas/pyarrow than would ever ship on Debian given the release strategy/timing delay.

On the other hand, arrow backend has significant advantages and with the rise of other important packages in the data space that also use pyarrow (polars, dask, modin), perhaps there is sufficient reason to add pyarrow to APT sources.

A good summary that might be worth checking out is Externally managed environments. The original PEP 668 is found here.

Answer 12 · 2023-08-30T18:29:28.000Z

I think it's the rigth path for performance in WASM.

Answer 13 · 2023-08-31T10:24:53.000Z

This is a good idea!
But I think there are also two important features should also be implemented except strings:

Zero-copy for multi-index dataframe. Currently, multi-index dataframe can not be convert from arrow table with zero copy(zero_copy_only=True), which is a BIGGER problem for big dataframe. You can reset_index() the dataframe, convert it to arrow table, and convert arrow table back to dataframe with zero copy, but after all, you must use call set_index() to the dataframe to get multi-index back, then copy happens.
Zero-copy for pandas.concat. Arrow table concat can be zero-copy, but when concat two zero-copy dataframe(convert from arrow table), copy happens even pandas COW is turned on. Also, currently, trying to concat two arrow table and then convert the table to dataframe with zero_copy_only=True is also not allowed as the chunknum>1.

Answer 14 · 2023-08-31T21:57:42.000Z

@mlkui

Regarding concat: This should already be zero copy:

df = pd.DataFrame({"a": [1, 2, 3]}, dtype="int64[pyarrow]")
df2 = pd.DataFrame({"a": [1, 2, 3]}, dtype="int64[pyarrow]")

x = pd.concat([df, df2])

This creates a new dataframe that has 2 pyarrow chunks.

Can you open a separate issue if this is not what you are looking for?

Answer 15 · 2023-09-01T03:25:57.000Z

@phofl
Thanks for your reply. But your example may be too simple. Please view the following codes(pandas 2.0.3 and pyarrow 12.0/ pandas 2.1.0 and pyarrow 13.0):

        with pa.memory_map("d:\\1.arrow", 'r') as source1, pa.memory_map("d:\\2.arrow", 'r') as source2, pa.memory_map("d:\\3.arrow", 'r') as source3, pa.memory_map("d:\\4.arrow", 'r') as source4:

            c1 = pa.ipc.RecordBatchFileReader(source1).read_all().column("p")
            c2 = pa.ipc.RecordBatchFileReader(source2).read_all().column("v")
            c3 = pa.ipc.RecordBatchFileReader(source1).read_all().column("p")
            c4 = pa.ipc.RecordBatchFileReader(source2).read_all().column("v")
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            s1 = c1.to_pandas(zero_copy_only=True)
            s2 = c2.to_pandas(zero_copy_only=True)
            s3 = c3.to_pandas(zero_copy_only=True)
            s4 = c4.to_pandas(zero_copy_only=True)
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            dfs = {"p": s1, "v": s2}
            df1 = pd.concat(dfs, axis=1, copy=False)                            #zero-copy
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            dfs2 = {"p": s3, "v": s4}
            df2 = pd.concat(dfs2, axis=1, copy=False)                           #zero-copy
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            # NOT zero-copy
            result_df = pd.concat([df1, df2], axis=0, copy=False)

        with pa.memory_map("z1.arrow", 'r') as source1, pa.memory_map("z2.arrow", 'r') as source2:

            table1 = pa.ipc.RecordBatchFileReader(source1).read_all()
            table2 = pa.ipc.RecordBatchFileReader(source2).read_all()
            combined_table = pa.concat_tables([table1, table2])
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))        #Zero-copy

            df1 = table1.to_pandas(zero_copy_only=True)
            df2 = table2.to_pandas(zero_copy_only=True)
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))       #Zero-copy

            #Use pandas to concat two zero-copy dataframes
            #But copy happens
            result_df = pd.concat([df1, df2], axis=0, copy=False)

            #Try to convert the arrow table to pandas directly
            #This will raise exception for chunk number is 2
            df3 = combined_table.to_pandas(zero_copy_only=True)

            # Combining chunks to one will cause copy
            combined_table = combined_table.combine_chunks()

Answer 16 · 2023-09-03T19:06:28.000Z

Beside the build size, there is a portability issue with pyarrow.

pyarrow does not provide wheels for as many environment as numpy.

For environments where pyarrow does not provide wheels, pyarrow has to be installed from source which is not simple.

Answer 17 · 2023-10-10T07:09:39.000Z

If this happens, would dtype='string' and dtype='string[pyarrow]' be merged into one implementation?

We’re currently thinking about coercing strings in our library, but hesitating because of the unclear future here.

Answer 18 · 2023-10-26T21:50:03.000Z

pyarrow does not provide wheels for as many environment as numpy.

The fact that they still don’t have Python 3.12 wheels up is worrisome.

Answer 19 · 2023-11-01T09:31:20.000Z

The fact that they still don’t have Python 3.12 wheels up is worrisome.

Arrow is a beast to build, and even harder to fit into a wheel properly (so you get less features, and things like using the slimmed-down libarrow will be harder to pull off).

Conda-forge builds for py312 have been available for a month already though, and are ready in principle to ship pyarrow with a minimal libarrow. That still needs some usability improvements, but it's getting there.

Answer 20 · 2023-11-03T21:12:58.000Z

Without weighing in on whether this is a good idea or a bad one, Fedora Linux already has a libarrow package that provides python3-pyarrow, so I think this shouldn’t be a real problem for us from a packaging perspective.

I’m not saying that Pandas is easy to keep packaged, up to date, and coordinated with its dependencies and reverse dependencies! Just that a hard dependency on PyArrow wouldn’t necessarily make the situation worse for us.

Answer 21 · 2023-11-30T10:01:46.000Z

@h-vetinari Almost there? :-)

Answer 22 · 2023-11-30T10:12:59.000Z

@h-vetinari Almost there? :-)

There is still a lot of work to be done on the wheels side but for conda after the work we did to divide the CPP library, I created this PR which is currently under discussion in order to provide both a pyarrow-base that only depends on libarrow and libparquet and pyarrow which would pull all the Arrow CPP dependencies. Both have been built with support for everything so depending on pyarrow-base and libarrow-dataset would allow the use of pyarrow.dataset, etc.

Answer 23 · 2023-12-08T17:15:52.000Z

Thanks for requesting feedback. I'm not well versed on the technicalities, but I strongly prefer to not require pyarrow as a dependency. It's better imo to let users choose to use PyArrow if they desire. I prefer to use the default NumPy object type or pandas' StringDType without the added complexity of PyArrow.

Answer 24 · 2023-12-08T19:49:12.000Z

@flying-sheep

If this happens, would dtype='string' and dtype='string[pyarrow]' be merged into one implementation?

We’re currently thinking about coercing strings in our library, but hesitating because of the unclear future here.

sorry for the slow response, dtype=string will be arrow backed starting from 3.0 or when you activate the infer_string option

Answer 25 · 2023-12-24T23:10:27.000Z

From the PDEP:

Starting in pandas 2.2, pandas raises a FutureWarning when PyArrow is not installed in the users environment when pandas is imported. This will ensure that only one warning is raised and users can easily silence it if necessary. This warning will point to the feedback issue.

Is this still planned? It doesn't seem to be occurring in 2.2.0rc0 👀

Answer 26 · 2024-01-10T19:07:58.000Z

From the PDEP:

Starting in pandas 2.2, pandas raises a FutureWarning when PyArrow is not installed in the users environment when pandas is imported. This will ensure that only one warning is raised and users can easily silence it if necessary. This warning will point to the feedback issue.

Is this still planned? It doesn't seem to be occurring in 2.2.0rc0 👀

I think we are going to add a DeprecationWarning now.
(It's not currently in master now, but I'm planning on putting in a warning before the actual release of 2.2).

Answer 27 · 2024-01-15T12:53:36.000Z

Hi, I don't know much about PyArrow overall but when it comes to saving large dataframes as CSV files, I detected that Pandas was being super slow and decided to give PyArrow a try instead, and the difference in performance was astounding, 8x times faster. For a 1GB, all np.float64 dataset:

pandas_df.to_csv(): Time to save: 45.128990650177 seconds.
pyarrow.csv.write_csv(): Time to save: 6.1338911056518555 seconds.

I tried stuff like different chucksizes and index=False but it did not help.

However, then I tested PyArrow for reading the exact same dataset and it was 2x slower than Pandas:

Time to read CSV (pyarrow): 14.770382642745972 seconds.
Time to read CSV (pandas): 8.440594673156738 seconds.

So, my suggestion I guess would be, to see which tasks are being done more efficiently by PyArrow and incorporate those, and the things that are faster/better in Pandas can stay the same (or maybe PyArrow can incorporate them).

Answer 28 · 2024-01-15T13:37:33.000Z

That's exactly what we intend to do. The csv default engine will stay the same for the time being

Answer 29 · 2024-01-15T14:07:06.000Z

That's exactly what we intend to do. The csv default engine will stay the same for the time being

Thanks for your answer Patrick. I missed that there is already an issue open already to add the pyarrow engine to the to_csv method here, so clearly I'm half a year late to the party. Excuse me for rushing to post, should I delete my previous post?

Answer 30 · 2024-01-20T09:46:57.000Z

My initial experience with pandas 2.2.0 + pyarrow is that the test suite crashes CPython on assertions. I will report a bug once I get a clear traceback. This will take some time, as I suppose I need to run them without xdist.

Answer 31 · 2024-01-20T12:56:56.000Z

My initial experience with pandas 2.2.0 + pyarrow is that the test suite crashes CPython on assertions. I will report a bug once I get a clear traceback. This will take some time, as I suppose I need to run them without xdist.

I'm sorry but I can't reproduce anymore. I have had apache-arrow built without all the necessary features, and I've fixed that while testing in serial, so my only guess is that the crashes were due to bad error handling when running tests with xdist. I'm sorry for the noise.

Answer 32 · 2024-01-22T16:08:05.000Z

pyarrow isn't compatible with the most recent versions of numpy (on 1.26)

pyarrow 0.15.0 would require
│ ├─ numpy >=1.16,<1.20.0a0 , which conflicts with any installable versions previously reported;

Answer 33 · 2024-01-22T18:55:45.000Z

Pyarrow 15 is the newest release, not 0.15

Answer 34 · 2024-01-23T03:21:36.000Z

NumPy is planning to add support for UTF-8 variable-width string DTypes in NEP 55.

Also, if PyArrow is truly going to be a required required dependency in Pandas 3.0, then I don't see the point of the current DeprecationWarning in pandas 2.2.0. All sane package managers install required dependencies automatically, so users don't need to take any action anyway.

Answer 35 · 2024-01-23T03:54:51.000Z

And as for my opinion: I personally find working with Pandas already complicated enough. So I'm afraid that throwing PyArrow is going to make things worse in that aspect.

In other words:

But like has been said before, the potential benefits haven't been made very clear (yet?), so it's hard to give constructive feedback.

Answer 36 · 2024-01-23T08:09:01.000Z

@phofl: I think it would be valuable that pandas' maintainers provide reasons for having pandas 3 require PyArrow as a dependency.

Answer 37 · 2024-01-23T08:12:33.000Z

Motivation is briefly outlined in PDEP 10.

pyarrow is already integrated in parts of pandas and it will most likely provide a way to solve the issue that pandas does not only work well with small amounts of data, but also with huge data where it is not the best option at the moment.

Answer 38 · 2024-01-23T08:40:24.000Z

Also, if PyArrow is truly going to be a required required dependency in Pandas 3.0, then I don't see the point of the current DeprecationWarning in pandas 2.2.0. All sane package python managers install required dependencies automatically, so users don't need to take any action anyway.

I have the same question - could someone point me to the justification for why the DeprecationWarning was added? Why do users need to manually install pyarrow now, or be told that a new dependency will be required in a release that isn't even out yet?

Answer 39 · 2024-01-23T18:21:04.000Z

The deprecation warning is ok - but I would like to have a specific pyarrrow "extra" of the pandas package, so that I know my version matches pandas' expectations.
Currently, three extras install pyarrow: "feather", "parquet", and "all".
It would be nice to add "pyarrow" extra until pandas 3.0 is out, which enables the following:

pip install "pandas[pyarrow]"

Answer 40 · 2024-01-24T08:10:10.000Z

Thanks for taking feedback from the community.

PDEP 10 lists the following benefits for making pyarrow a required dependency:

Immediate User Benefit 1: pyarrow strings
Immediate User Benefit 2: Nested Datatypes
Immediate User Benefit 3: Interoperability

From my pov none of these benefits the typical pandas user, unless they already use pyarrow. If they don't they probably don't need the complexity that pyarrow brings with it (as any package of that magnitude does). In this sense I don't feel the rationale given in the PDEP would find a majority in the wider community.

In my opinion, pyarrow should be kept as an optional extra for those users who may need it. This way everyone benefits, from small to large use cases. If pyarrow is made a required dependency primarily large use cases benefit, while all the majority of use cases incur quite a substantial cost (not least due to requiring more disk space but also by making it more difficult to install pandas in some environments).

Answer 41 · 2024-01-24T19:23:58.000Z

Thanks all for comments!

I can't say anything for certain yet, but I'll start by noting that it looks like this may not be a done deal.

On the numpy side: https://github.com/numpy/numpy/pull/25625/files

we will add implementations for the comparison operators as well as an
add loop that accepts two string arrays, multiply loops that accept
string and integer arrays, an isnan loop, and implementations for the
str_len, isalpha, isdecimal, isdigit, isnumeric,
isspace, find, rfind, count, strip, lstrip, rstrip,
and replace string ufuncs that will be newly available in NumPy 2.0.

and on today's pandas community call, it was mentioned that

if there's a viable alternative to pyarrow strings, then maybe pyarrow doesn't need to be made required

More updates coming in due course

Answer 42 · 2024-01-25T10:27:13.000Z

Warning (from warnings module):
File ", line 1
import pandas as pd
DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at #54466

BUT I GET THE OUTPUT
I DONT WANT TO GET THE WARNING MESSAGE
I WANT TO IGNORE THAT WARNING MESSAGE

Answer 43 · 2024-01-25T10:39:15.000Z

You can install pyarrow to silence the warning. In some other places we're thinking of switching to polars since this warning has come up.

Answer 44 · 2024-01-25T10:53:48.000Z

Alternatively, if you want to just silence the warning for now:

import warnings

with warnings.catch_warnings():
    warnings.filterwarnings(
        "ignore",
        message=r'\nPyarrow will become',
        category=DeprecationWarning,
    )
    import pandas as pd

I wouldn't normally suggest silencing deprecationwarnings, but given the circumstances this one may be different

Alternatively, just pin pandas < 2.2 for now

Answer 45 · 2024-01-25T11:02:44.000Z

@MarcoGorelli I don't see people writing this much code on top of so many of their files/modules/notebooks to silence the warning. It's very annoying, and making CIs fail, where the only solution for those CIs is to add pyarrow to the deps, which itself is huge.

Answer 46 · 2024-01-25T13:52:25.000Z

You can install pyarrow to silence the warning. In some other places we're thinking of switching to polars since this warning has come up.

how to install?

Answer 47 · 2024-01-25T13:54:03.000Z

like this: https://arrow.apache.org/docs/python/install.html

Answer 48 · 2024-01-25T14:30:44.000Z

Data and DataFrame/Untitled.py:4: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at #54466

import pandas as pd

Answer 49 · 2024-01-25T14:33:23.000Z

FYI, AWS dependencies of pyarrow are another huge issue:

scikit-learn/scikit-learn#28258 (comment)

Answer 50 · 2024-01-25T18:33:45.000Z

More updates coming in due course

As promised: #57073

Answer 51 · 2024-01-26T07:46:24.000Z

Alternatively, if you want to just silence the warning for now:

It is quite unfortunate that the warning message starts with a newline, which makes it hard to target speficically by message with python -W or PYTHONWARNINGS, unless I missed something. For example there is still a warning with this command:

python -W 'ignore:\nPyarrow:DeprecationWarning' -c 'import pandas'

I opened #57082 about it.

Answer 52 · 2024-01-26T13:52:38.000Z

Please remove deprecation warning every time pandas is imported! For example, make it to appear only if some specific file does not exist, and deprecation message should tell user which file to create to suppress the warning.

Answer 53 · 2024-01-27T12:01:37.000Z

Note that pyarrow currently does not build with pypy: apache/arrow#19046

I checked just now and indeed found compilation failure:

FAILED: CMakeFiles/lib.dir/lib.cpp.o
/usr/bin/x86_64-pc-linux-gnu-g++ -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -Dlib_EXPORTS -I/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python/pyarrow/src -I/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/pyarrow/src -isystem /usr/include/pypy3.10 -isystem /usr/lib/pypy3.10/site-packages/numpy/core/include -Wno-noexcept-type -Wno-self-move  -Wall -fno-semantic-interposition -msse4.2 -march=native -mtune=native -O3 -pipe -frecord-gcc-switches -flto=16 -fdiagnostics-color=always -march=native -mtune=native -O3 -pipe -frecord-gcc-switches -flto=16 -fno-omit-frame-pointer -Wno-unused-variable -Wno-maybe-uninitialized -O3 -DNDEBUG -O2 -ftree-vectorize  -std=c++17 -fPIC -Wno-unused-function -Winvalid-pch -include /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/CMakeFiles/lib.dir/cmake_pch.hxx -MD -MT CMakeFiles/lib.dir/lib.cpp.o -MF CMakeFiles/lib.dir/lib.cpp.o.d -o CMakeFiles/lib.dir/lib.cpp.o -c /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/lib.cpp
In file included from /usr/include/pypy3.10/Python.h:55,
from /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python/pyarrow/src/arrow/python/platform.h:27,
from /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python/pyarrow/src/arrow/python/pch.h:24,
from /tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/CMakeFiles/lib.dir/cmake_pch.hxx:5,
from <command-line>:
/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/lib.cpp: In function ‘PyObject* __pyx_pf_7pyarrow_3lib_17SignalStopHandler_6__exit__(__pyx_obj_7pyarrow_3lib_SignalStopHandler*, PyObject*, PyObject*, PyObject*)’:
/tmp/portage/dev-python/pyarrow-14.0.2/work/apache-arrow-14.0.2/python-pypy3/build/temp.linux-x86_64-pypy310/lib.cpp:41444:7: error: ‘PyPyErr_SetInterrupt’ was not declared in this scope; did you mean‘PyErr_SetInterrupt’?
41444 |       PyErr_SetInterrupt();
|       ^~~~~~~~~~~~~~~~~~

Answer 54 · 2024-01-27T12:05:33.000Z

VirusTotal is not always happy with Pyarrow wheels... example on 15.0 https://www.virustotal.com/gui/file/17d53a9d1b2b5bd7d5e4cd84d018e2a45bc9baaa68f7e6e3ebed45649900ba99

Answer 55 · 2024-01-27T17:30:46.000Z

+1 to making it easier to silence the warning. I have no opinion on the pyarrow dependency change but the red warning text in notebook outputs is distracting when they’re meant to be published or shared with colleagues.

Answer 56 · 2024-01-28T12:24:30.000Z

VirusTotal is not always happy with Pyarrow wheels... example on 15.0 https://www.virustotal.com/gui/file/17d53a9d1b2b5bd7d5e4cd84d018e2a45bc9baaa68f7e6e3ebed45649900ba99

Wasn't aware of that, thanks - is it happy with the current pandas wheels as they are? Is this fixable on the VirusTotal side, and if so, could it be reported to them?

Answer 57 · 2024-01-28T13:46:25.000Z

It's happy with latest pandas wheels

Answer 58 · 2024-01-29T13:02:27.000Z

Trying to simply install pyarrow to silence the DeprecationWarning causes our tests to fail, e.g.:

FAILED tests/core/test_meta.py::test_run_meta[test_sqlite_mp] - pyarrow.lib.ArrowNotImplementedError: Function 'not_equal' has no kernel matching input types (large_string, double)

I'm not entirely sure why this happens and it only does when pandas[feather] is installed, not with pandas itself. So I guess I'll keep the warning until a much-appreciated migration guide clarifies how to address this issue (if pyarrow ends up being required).

Answer 59 · 2024-01-29T14:06:23.000Z

@glatterf42 could you copy paste the test content?

Answer 60 · 2024-01-29T14:38:23.000Z

Sure :)

There is more than one test, but they all boil down to the same line:

Full traceback of one test

______________________________________________________ test_run_meta[test_sqlite_mp] _______________________________________________________

test_mp = <ixmp4.core.platform.Platform object at 0x7ffae19bd150>, request = <FixtureRequest for <Function test_run_meta[test_sqlite_mp]>>

    @all_platforms
    def test_run_meta(test_mp, request):
        test_mp = request.getfixturevalue(test_mp)
        run1 = test_mp.runs.create("Model 1", "Scenario 1")
        run1.set_as_default()
    
        # set and update different types of meta indicators
>       run1.meta = {"mint": 13, "mfloat": 0.0, "mstr": "foo"}

tests/core/test_meta.py:18: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ixmp4/core/run.py:52: in meta
    self._meta._set(meta)
ixmp4/core/run.py:122: in _set
    self.backend.meta.bulk_upsert(df)
ixmp4/core/decorators.py:15: in wrapper
    return checked_func(*args, **kwargs)
.venv/lib/python3.10/site-packages/pandera/decorators.py:754: in _wrapper
    out = wrapped_(*validated_pos.values(), **validated_kwd)
ixmp4/data/auth/decorators.py:37: in guarded_func
    return func(self, *args, **kwargs)
ixmp4/data/db/meta/repository.py:194: in bulk_upsert
    super().bulk_upsert(type_df)
ixmp4/data/db/base.py:339: in bulk_upsert
    self.bulk_upsert_chunk(df)
ixmp4/data/db/base.py:357: in bulk_upsert_chunk
    cond.append(df[col] != df[updated_col])
.venv/lib/python3.10/site-packages/pandas/core/ops/common.py:76: in new_method
    return method(self, other)
.venv/lib/python3.10/site-packages/pandas/core/arraylike.py:44: in __ne__
    return self._cmp_method(other, operator.ne)
.venv/lib/python3.10/site-packages/pandas/core/series.py:6099: in _cmp_method
    res_values = ops.comparison_op(lvalues, rvalues, op)
.venv/lib/python3.10/site-packages/pandas/core/ops/array_ops.py:330: in comparison_op
    res_values = op(lvalues, rvalues)
.venv/lib/python3.10/site-packages/pandas/core/ops/common.py:76: in new_method
    return method(self, other)
.venv/lib/python3.10/site-packages/pandas/core/arraylike.py:44: in __ne__
    return self._cmp_method(other, operator.ne)
.venv/lib/python3.10/site-packages/pandas/core/arrays/arrow/array.py:704: in _cmp_method
    result = pc_func(self._pa_array, self._box_pa(other))
.venv/lib/python3.10/site-packages/pyarrow/compute.py:246: in wrapper
    return func.call(args, None, memory_pool)
pyarrow/_compute.pyx:385: in pyarrow._compute.Function.call
    ???
pyarrow/error.pxi:154: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowNotImplementedError: Function 'not_equal' has no kernel matching input types (large_string, double)

pyarrow/error.pxi:91: ArrowNotImplementedError

Verbose description

The test is defined here with the fixtures coming from here and here.

The line in question is in ixmp4/data/db/base.py in the bulk_upsert_chunk() function. It combines a pandas.DataFrame from an existing and to-be-added one and is then trying to figure out which of the columns was updated. There's a limited set of columns that may be updated. During the combination process, the to-be-added columns receive a _y suffix to be distinguishable. If such an updatable column is found the the combined dataframe, a bool should be added to a list if it's truly different from the existing one. And precisely this condition check, df[col] != df[updated_col], fails when pyarrow is present.

Answer 61 · 2024-01-30T17:39:31.000Z

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,

I am getting this error after trying to import pandas

Answer 62 · 2024-01-30T18:25:18.000Z

A little late to the party, but wanted to add an objection from me due to the hugely increased installation size from PyArrow.

Primarily, this relates to AWS Lambda. I use Pandas significantly in the AWS Lambda environment, and this would cause headaches. I think it is just possible to get Pandas and PyArrow into a Lambda package, but means there is very little room for anything else in there.

I tried to experiment with this recently, and couldn't get it smaller enough to the point I could have the other stuff in the package that I wanted. I believe the work-around is to use containers with Lambda instead, but this requires a whole shift in deployment methodology for a single package dependency. There would be a further trade-off from the increased start times due to having to load a significantly larger package (or container).

I realise that this environment-specific objection may not have much weight, but my other comment would be:

Pandas is generally one of the first, approachable ways for new users to start playing around with data, and data-science tools. Specifically, a tool that can then be scaled towards more advanced usage. My experience has been that installing PyArrow can be a complex process, filled with pit-holes that can make what is currently a relatively simple installation process, a real headache. I think that this change could really harm the approachability of Pandas, and put off future users.

I would strongly request that PyArrow remain an optional dependency that advanced users (who by definition would be able to handle any installation requirements), can install and configure if necessary.

Answer 63 · 2024-01-30T21:37:37.000Z

Next to pyarrow and numpy, related (recent) literature https://pola.rs/posts/polars-string-type/

Answer 64 · 2024-01-31T20:36:22.000Z

whenever i am using pandas..this pyArrow showing and everytime i'm getting problem of using pandas, everytime i'm running pandas in python.please help

Answer 65 · 2024-01-31T20:59:15.000Z

Sorry if I'm missing this somewhere, but is there a way to silence this warning?

Answer 66 · 2024-01-31T21:20:05.000Z

is there a way to silence this warning?

Install pyarrow!

Or if you still want to avoid doing that for now, you can silence the warning with the stdlib warnings.filterwarning module:

>>> import warnings
>>> warnings.filterwarnings("ignore", "\nPyarrow", DeprecationWarning)
>>> import pandas

(unfortunately it currently doesn't work as -W command line argument or pytest config option, see #57082)

Answer 67 · 2024-01-31T21:27:54.000Z

Perfect! Thanks @jorisvandenbossche

Answer 68 · 2024-02-01T07:21:33.000Z

Warning (from warnings module):
File "C:/Git/Work/Pyton/Pandas_ecel.py", line 1
import pandas as pd
DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at #54466

Answer 69 · 2024-02-01T08:05:51.000Z

@jagoodhand I may have got it wrong but, from my understanding, by the time PyArrow becomes a mandatory dependency of Pandas 3.0.0, that dependency will be a new package that doesn't exists today basically around libarrow called pyarrow-minimal, that will be much much smaller (in size) and portable (in terms of CPU architectures, that may narrow the gap to Numpy current availability in that matter) and will be released with PyArrow 15.

@h-vetinari & devs, please correct me if I'm wrong...

Answer 70 · 2024-02-01T08:12:40.000Z

@ZupoLlask - if it addresses the two issues I mentioned:

Package Size
Installation complexity / compatibility / portability i.e. easily being able to install on different platforms

Then my objections aren't objections any more, but it doesn't sound like this is the case. Would be good to have more detail or confirmation on what this would look like though.

Answer 71 · 2024-02-01T10:41:59.000Z

by the time PyArrow becomes a mandatory dependency of Pandas 3.0.0, that dependency will be a new package that doesn't exists today basically around libarrow called pyarrow-minimal, that will be much much smaller (in size) and portable (in terms of CPU architectures, that may narrow the gap to Numpy current availability in that matter) and will be released with PyArrow 15.

This is not exactly the case. Let me expand a little on what is happening at the moment:

The Arrow team did release Arrow and pyarrow 15.0.0 a couple of weeks ago. There is some ongoing work and efforts from the Arrow community in reducing the footprint of minimal builds of Arrow. At the moment there is an opened PR on the conda feedstock for Arrow, which I am working on, to be able to have several different installations for pyarrow. Based on review and design discussions it seems there will be pyarrow-core, pyarrow and pyarrow-all with different subsets of features and sizes.

There is no change about the current CPU architectures supported but please if your system is not supported you can always open an issue or a feature request to the Arrow repository.

We still have to plan and do the work for published wheels on PyPI but this still requires planning and contributors to actively work on. Some issues that are related: apache/arrow#24688

Answer 72 · 2024-02-01T10:55:44.000Z

We still have to plan and do the work for published wheels on PyPI but this still requires planning and contributors to actively work on. Some issues that are related: apache/arrow#24688

For the purpose of being able to package PyArrow in smaller wheels, I had created https://github.com/amol-/consolidatewheels but it would require some real world testing. https://github.com/amol-/wheeldeps was created as an example, but the more testing we can get, the faster we will be able to split pyarrow wheels

Answer 73 · 2024-02-02T12:00:17.000Z

Bueno sí algunos le preocupa la ram, lo optimo son 16 gb para hacer trabajos solidos, pero bueno cada uno ve su alcance con su cliente.

Answer 74 · 2024-02-02T17:53:42.000Z

This is not feedback on the decision to include pyarrow as a dependency, but rather on the usage of deprecation warnings to solicit feedback. Raising deprecation warnings (especially in the main __init__.py) adds a lot of noise to downstream projects. It also creates a development burden for packages whose CI treats warnings as errors (see for example bokeh/bokeh#13656 and zapata-engineering/orquestra-cirq#53). Ideally deprecation warnings would be reserved for changes to the pandas public API and be raised only in the code paths affected by the change.

Answer 75 · 2024-02-02T23:51:17.000Z

This is not feedback on the decision to include pyarrow as a dependency, but rather on the usage of deprecation warnings to solicit feedback. [...] Ideally deprecation warnings would be reserved for changes to the pandas public API and be raised only in the code paths affected by the change.

Even by your own logic, including a warning was the right choice. The inclusion of PyArrow will come with a major change to the pandas public API: "Starting in pandas 3.0, the default type inferred for string data will be ArrowDtype with pyarrow.string instead of object." (quoted from the PDEP)

However, I think a FutureWarning, as originally was proposed in the PDEP, would have made more sense than the DeprecationWarning that was implemented.

Regardless, if the deprecation warning creates issues for you, you can just install PyArrow to make it go away. If installing PyArrow would create issues for you, that's what this issue is for. Considering the change can cause CI failures, the warning preemptively causing CI failures seems like the lesser of two bad options.

Bias disclosure: I'm impacted negatively by the upcoming change.

Answer 76 · 2024-02-06T02:11:19.000Z

I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS

Answer 77 · 2024-02-06T05:51:41.000Z

the first thought that arose was to replace pandas with another similar tool

Answer 78 · 2024-02-06T10:36:58.000Z

I dislike the process here, and I don't mean the dep warning.

Why are you doing this? What are the pros and cons? Surely you have discussed doing this and there are pros and cons for this. Please link to that discussion. ("If it's difficult to explain, it's a bad idea")
Why is increasing the complexity of your package the default and correct way of providing this functionality?

I appreciate the message and asking for feedback, but it went out to everyone and that will include people like me who have no idea what's going on. It is generally your business how you run your project (Thank you for your work and software), but if you do want feedback and if you do want to be inclusive, please think about how you are onboarding to this issue.

Generally, complexity is bad and changing things in bad, because there is the risk of new errors. So you are starting at a negative score in my book and this whole thing would require a significant gain and not just a neutral tradeoff between increased size and some performance.

(I think there is a general blindness in this respect from package maintainers, because you are working with this every day and you think some increase in complexity is acceptable for [reasons] and this continues for decades and then you have a bloated mess.)

Does it have to be done this way, can't you create a new package that uses the advantages of both packages and overrides the original function? Then if people want to they can use both and it leaves the original thing untouched. Maybe put a note into the docs pointing to the optimization.

Answer 79 · 2024-02-06T12:53:56.000Z

Surely you have discussed doing this and there are pros and cons for this. Please link to that discussion.

The discussion is linked in the PDEP itself - #52711

Answer 80 · 2024-02-06T13:05:18.000Z

I know this isn't super relevant to the discussion, but I want to throw this out here anyway. Sometimes, even a harmless change like displaying a DeprecationWarning can have undesired repercussions.

I teach Python courses for programming beginners, and since the 2.2.0 release I've received many questions and messages from students confused by the warning. They are left wondering if they installed pandas correctly, if they need to install something called "arrow", or whether they can continue the course at all.

Yes, I know the students should eventually get used to warning messages, and this discussion is definitely relevant to the Data Science community. But realistically, 99% of the people to ever import pandas as pd will never come remotely close to it.

As stated previously, if pyarrow ever becomes a dependency of pandas (disregarding whether that's a good or a bad thing), the vast majority of users shouldn't even notice any difference. Everything should "just work" when they type pip install pandas. As a result, I find the decision to display a DeprecationWarning to the entire user base upon importing pandas unfortunate.

Answer 81 · 2024-02-06T14:44:18.000Z

Well, I think all these contributions for the discussion end up being useful for the community as a whole.

Maybe developers may consider another approach regarding communication of deprecation:

including major pending deprecation warnings in the changelog / release notes for every new release;
creating some kind of verbose deprecation mode so interested developers can check and test their code future compatibility, while disabling this level of DeprecationWarning verbosity disabled for the regular users.

There is no perfect solution to deal with the current situation, but I'm positive PyArrow will bring very good benefits for Pandas in the future! 🙂

Answer 82 · 2024-02-06T20:11:38.000Z

I want to follow up on #54466 (comment) from above about a pyarrow extra. The message just says that you need to have "Pyarrow". It would be better if it suggested installing pandas[feather] (or pandas[pyarrow] if feather does not just mean pyarrow). Adding transitive dependencies to a project''s dependency list should be avoided if possible. From the warning message, it seems that the suggested solution is to add pyarrow to your dependency list.

Also, since the warning directs users to this issue, it would be nice if the issue description were edited to include suggestions on how to avoid it -- both whether to add pyarrow to your dependencies or use pandas[feather] and also the filterwarnings solution.

Answer 83 · 2024-02-06T23:04:30.000Z

I agree @wshanks, I opened #57284 to introduce that extra. If people like it, I can add a docs entry for Pandas 2.2.1

Answer 84 · 2024-02-07T11:17:49.000Z

This change is making a mess in CI jobs. Suppressing the warning as suggested in #54466 (comment) is not a viable solution and I could not even find a robust way to code "exclude Pandas versions >=2.2 AND < 3" as a requirement specifier in pyproject.toml.

Answer 85 · 2024-02-07T17:19:22.000Z

This is not feedback on the decision to include pyarrow as a dependency, but rather on the usage of deprecation warnings to solicit feedback. [...] Ideally deprecation warnings would be reserved for changes to the pandas public API and be raised only in the code paths affected by the change.

Even by your own logic, including a warning was the right choice. The inclusion of PyArrow will come with a major change to the pandas public API: "Starting in pandas 3.0, the default type inferred for string data will be ArrowDtype with pyarrow.string instead of object." (quoted from the PDEP)

However, I think a FutureWarning, as originally was proposed in the PDEP, would have made more sense than the DeprecationWarning that was implemented.

Regardless, if the deprecation warning creates issues for you, you can just install PyArrow to make it go away. If installing PyArrow would create issues for you, that's what this issue is for. Considering the change can cause CI failures, the warning preemptively causing CI failures seems like the lesser of two bad options.

Bias disclosure: I'm impacted negatively by the upcoming change.

I agree that including a warning for string type inference makes sense. However I'm not sure that the main __init__.py is the best place for this warning because it creates noise for projects that do not depend on string type inference and therefore may not be affected by the change.

Also I understand that the warning can be suppressed by installing PyArrow. The point is that any approach to suppressing the warning requires a certain amount of knowledge and effort. I'm thinking for example of the questions that @jfaccioni-asimov gets from confused students.

Answer 86 · 2024-02-08T08:23:40.000Z

When switching to pyarrow for the string dtype, it would be good if some of the existing performance issues with the string dtype are addresses beforehand. Currently (pandas 2.2.0), string[pyarrow] is the slowest solution for some tasks:

import pandas as pd
import timeit

points = 1000000
data = [f"data-{n}" for n in range(points)]
for dtype in ["object", "string", "string[pyarrow]"]:
    index = pd.Index([f"index-{n}" for n in range(points)], dtype=dtype)
    df = pd.DataFrame(data, index=index, dtype=dtype)
    print(dtype)
    %timeit df.loc['index-2000']

which returns

object
9.78 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string
15.7 µs ± 36.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string[pyarrow]
17.6 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Answer 87 · 2024-02-08T16:07:16.000Z

I'm a contributor to Panel by HoloViz.

Pandas is used extensively in the HoloViz ecosystem. Its a hard requirement of Panel.

Usage in pyodide and pyscript has really benefitted us a lot. It has made our docs interactive and enabled our users to share live Python GUI applications in the browser without having to fund and manage a server.

As far as I can see Pyarrow does not work with pyodide. I.e. Pandas would no longer work in Pyodide? I.e. Panel would no longer work in Pyodide?

Thinking outside of HoloViz Panel I believe that making Pandas unusable in Pyodide or increasing the download time risks making all gains of Python in the browser with Pyodide and Pyscript unusable.

Thanks for asking for feedback. Thanks for Pandas.

Answer 88 · 2024-02-08T16:22:41.000Z

There is ongoing work about Pyarrow support in Pyodide, for example see pyodide/pyodide#2933. If I try to use my crystal ball, my guess is that pandas developers have this in mind. Also even in the case of pandas 3.0 go out, require Pyarrow and Pyarrow support is still not there in Pyodide, you will always be able to use older pandas versions in Pyodide so unless you need a pandas 3.0 feature, you will be fine.

Answer 89 · 2024-02-08T16:26:39.000Z

Thx @lesteve .

Panel might not need the never version of Pandas. But users will also be using Pandas when they develop their data driven applications using Pandas and Panel. And they would expect to be on a recent version of Pandas.
And the package size of pyarrow would also increase down load time in pyodide considerably.

These issues are not limited to Panel. They will limit entire PyData ecosystem using pyodide to make their docs interactive without spending huge amounts on servers. They will also limit Streamlit (Stlite), Gradio (Gradiolite), Jupyterlite, PyScript etc. running in the browser. Which is where the next 10 million Python users are expected to come from.

Answer 90 · 2024-02-08T22:25:54.000Z

Are there 3 distinct arrow string types in pandas?

"string[pyarrow_numpy]"
"string[pyarrow]"
pd.ArrowDtype(pa.string())

Is the default going to be string[pyarrow_numpy]? What are the differences between the 3 string datatypes and when should 1 be used over the other? Do they all perform the same because they use the same arrow memory layout and compute kernels?

Answer 91 · 2024-02-09T10:58:00.000Z

is there a way to silence this warning?

You can do it with the stdlib warnings.filterwarning module:
>>> import warnings
>>> warnings.filterwarnings("ignore", "\nPyarrow", DeprecationWarning)
>>> import pandas
(unfortunately it currently doesn't work as -W command line argument or pytest config option, see #57082)

If you're using pytest and the warnings are polluting your CI pipelines, you can ignore this warning by editing your pytest.ini like so:

[pytest]
filterwarnings =
    ignore:\nPyarrow:DeprecationWarning

See pytest docs on controlling warnings.

Answer 92 · 2024-02-10T03:56:39.000Z

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)

I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).

Answer 93 · 2024-02-11T16:37:29.000Z

is there a way to silence this warning?

You can do it with the stdlib warnings.filterwarning module:
>>> import warnings
>>> warnings.filterwarnings("ignore", "\nPyarrow", DeprecationWarning)
>>> import pandas
(unfortunately it currently doesn't work as -W command line argument or pytest config option, see #57082)

I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS

Answer 94 · 2024-02-11T16:37:38.000Z

I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS

Answer 95 · 2024-02-11T16:38:01.000Z

I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS

Answer 96 · 2024-02-11T16:38:18.000Z

I want to add that there are some versions of Linux distributions which are either extended support or LTS that would be very hard to install pyarrow as they don't get packaged like CentOS 7 and ubuntu 18.04 LTS

Answer 97 · 2024-02-11T17:15:28.000Z

FYI, I've added pyarrow dep on 2024-01-20 to the Gentoo ebuild and requested testing on the architectures we support. So far it's looking grim — no success on ARM, AArch64, PowerPC, X86. I feel like I'm now being made responsible for fixing Arrow, that doesn't seem to be very portable in itself.

Answer 98 · 2024-02-11T17:33:37.000Z

Arrow, that doesn't seem to be very portable in itself.

We build arrow and run the test suite successfully on all the mentioned architecture in conda-forge, though admittedly the stack of dependencies is pretty involved (grpc, protobuf, the major cloud SDKs, etc.). Feel free to check out our recipe if you need some inspiration, or open an issue on the feedstock if you have some questions.

Answer 99 · 2024-02-12T12:52:03.000Z

Dear maintainers and core devs,

thank you for making Pandas available to the community. Since you ask for feedback, here's my humble opinion

As a longtime user and developer of open-source libraries which depend on Pandas, I mostly deal with (possibly) large Dataframes with homogeneous dtype (np.float64), and I treat them (for the most part) as wrapper around the corresponding Numpy 2-dimensional Arrays. The reason I use Pandas Dataframes as opposed to plain Numpy Arrays is that I find Pandas indexing capabilities to be its "killer" feature, it's much safer from my point of view to keep track of indexing in Pandas rather than Numpy, especially when considering Datetime indexes or multi-indexes. The same applies to Series and Numpy 1-dimensional Arrays.

I have no objections to using Arrow as back-end to store string, object dtypes, or in general non-homogeneous dtype Dataframes.

I would like, however, to hear whether you plan to switch away from Numpy as one of the core back-ends (in my usecases, the most important one). This is relevant for various reasons, including memory management. It would be great to know if in the future one will have to worry that manipulating large 2-dimensional Numpy Arrays of floats by casting them as Dataframes will involve a conversion into Arrow, and back to Numpy (if then I want them back as such). That would be very problematic, since it involves a whole new layer of complexity.

Thanks,
Enzo