laughingman7743/PyAthena

Impl Polars cursor

laughingman7743 opened this issue · 7 comments

polars cursor would be a godsend!

Polars uses Arrow as its memory representation, so, as I understand it, supporting Polars in PyAthena is mostly just a syntactic shortcut, right? Polars' documentation for the from_arrow method says:

This operation will be zero copy for the most part. Types that are not supported by Polars may be cast to the closest supported type.

So except for that note about unsupported types, the following code should have basically no overhead already today:

import polars as pl
import pyathena
from pyathena.arrow.cursor import ArrowCursor

cursor = pyathena.connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=ArrowCursor).cursor()

# This should be zero-copy most of the time
polars_df = pl.from_arrow(cursor.execute("SELECT * FROM many_rows").as_arrow())

I actually tried out PyAthena → Arrow → Polars in this fashion the other day, so I can at least confirm this is functional (i.e. it will populate a Polars DataFrame that works, I didn't verify anything about copying or performance overheads)

Polars uses Arrow as its memory representation, so, as I understand it, supporting Polars in PyAthena is mostly just a syntactic shortcut, right? Polars' documentation for the from_arrow method says:

This operation will be zero copy for the most part. Types that are not supported by Polars may be cast to the closest supported type.

So except for that note about unsupported types, the following code should have basically no overhead already today:

import polars as pl
import pyathena
from pyathena.arrow.cursor import ArrowCursor

cursor = pyathena.connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=ArrowCursor).cursor()

# This should be zero-copy most of the time
polars_df = pl.from_arrow(cursor.execute("SELECT * FROM many_rows").as_arrow())

I actually tried out PyAthena → Arrow → Polars in this fashion the other day, so I can at least confirm this is functional (i.e. it will populate a Polars DataFrame that works, I didn't verify anything about copying or performance overheads)

Hi, may I ask what version of pyarrow are you using ? I have an error with version 15.0.0

OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

or if another solution comes to your mind to explain this error. Thank you

Hi, may I ask what version of pyarrow are you using ? I have an error with version 15.0.0

OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

or if another solution comes to your mind to explain this error. Thank you

This was many months ago, I no longer recall what version I was using... but your error is to all appearances a network connectivity problem, says so right there in the message

Yes, I totally agree, but it's cryptic to me since it's working with another cursor (like pandasCursor for example)

Hi, may I ask what version of pyarrow are you using ? I have an error with version 15.0.0

OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

or if another solution comes to your mind to explain this error. Thank you

This was many months ago, I no longer recall what version I was using... but your error is to all appearances a network connectivity problem, says so right there in the message

FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_executemany[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/a580fb77-99b1-49c8-8f70-cc3eaf663089' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_fetchall[arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/93571118-03bb-4b01-9772-4b1f99dc9f61.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_executemany_fetch[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/f7a80ee7-26c4-4103-bf49-c94b29c6eea0' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_fetchall[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/573dd21d-a3fe-4bbe-a7b5-aa1807dfd2a6.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_complex_unload_as_arrow[arrow_cursor0] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/f23455fe-6929-439d-864b-d52b55b7be7a' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_iterator[arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/b94d398d-1cac-45b2-b5d3-9210897b6d5f.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_fetchall[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/c823bf59-cd6b-4e0c-9600-690de08d3f18' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_iterator[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/c3433f82-a070-4044-be85-1d18786d1311' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_arraysize[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/64c7bec3-8281-4807-ae95-fb19ca8d0159' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_iceberg_table - pyathena.error.OperationalError: When reading information for key 'tmp/bbf31c91-e845-4ccf-8b8b-588147fcf4e7.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_arraysize[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/76e1c3c4-ed18-4a18-ad6f-3fe9dc5db8a1.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_arraysize[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/4f059472-9026-49ac-892f-cfd47e5eac81' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_complex_unload[arrow_cursor0] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/170fbcee-1269-4367-8d3e-b8e43f838b79' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_description[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/baf70311-3540-4dce-ae43-8aecc19566b1' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_query_execution[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/0723649b-bb5c-450b-baa2-d52ea3d8a7aa.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

An error occurred when I ran the test in the local environment. 🤔
This is not occurring in GitHubActions.
#520