Impl Polars cursor
laughingman7743 opened this issue · 7 comments
polars cursor would be a godsend!
Polars uses Arrow as its memory representation, so, as I understand it, supporting Polars in PyAthena is mostly just a syntactic shortcut, right? Polars' documentation for the from_arrow
method says:
This operation will be zero copy for the most part. Types that are not supported by Polars may be cast to the closest supported type.
So except for that note about unsupported types, the following code should have basically no overhead already today:
import polars as pl
import pyathena
from pyathena.arrow.cursor import ArrowCursor
cursor = pyathena.connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
region_name="us-west-2",
cursor_class=ArrowCursor).cursor()
# This should be zero-copy most of the time
polars_df = pl.from_arrow(cursor.execute("SELECT * FROM many_rows").as_arrow())
I actually tried out PyAthena → Arrow → Polars in this fashion the other day, so I can at least confirm this is functional (i.e. it will populate a Polars DataFrame that works, I didn't verify anything about copying or performance overheads)
Polars uses Arrow as its memory representation, so, as I understand it, supporting Polars in PyAthena is mostly just a syntactic shortcut, right? Polars' documentation for the
from_arrow
method says:This operation will be zero copy for the most part. Types that are not supported by Polars may be cast to the closest supported type.
So except for that note about unsupported types, the following code should have basically no overhead already today:
import polars as pl import pyathena from pyathena.arrow.cursor import ArrowCursor cursor = pyathena.connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/", region_name="us-west-2", cursor_class=ArrowCursor).cursor() # This should be zero-copy most of the time polars_df = pl.from_arrow(cursor.execute("SELECT * FROM many_rows").as_arrow())
I actually tried out PyAthena → Arrow → Polars in this fashion the other day, so I can at least confirm this is functional (i.e. it will populate a Polars DataFrame that works, I didn't verify anything about copying or performance overheads)
Hi, may I ask what version of pyarrow are you using ? I have an error with version 15.0.0
OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
or if another solution comes to your mind to explain this error. Thank you
Hi, may I ask what version of pyarrow are you using ? I have an error with version
15.0.0
OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reachedor if another solution comes to your mind to explain this error. Thank you
This was many months ago, I no longer recall what version I was using... but your error is to all appearances a network connectivity problem, says so right there in the message
Yes, I totally agree, but it's cryptic to me since it's working with another cursor (like pandasCursor for example)
Hi, may I ask what version of pyarrow are you using ? I have an error with version
15.0.0
OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reachedor if another solution comes to your mind to explain this error. Thank you
This was many months ago, I no longer recall what version I was using... but your error is to all appearances a network connectivity problem, says so right there in the message
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_executemany[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/a580fb77-99b1-49c8-8f70-cc3eaf663089' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_fetchall[arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/93571118-03bb-4b01-9772-4b1f99dc9f61.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_executemany_fetch[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/f7a80ee7-26c4-4103-bf49-c94b29c6eea0' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_fetchall[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/573dd21d-a3fe-4bbe-a7b5-aa1807dfd2a6.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_complex_unload_as_arrow[arrow_cursor0] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/f23455fe-6929-439d-864b-d52b55b7be7a' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_iterator[arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/b94d398d-1cac-45b2-b5d3-9210897b6d5f.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_fetchall[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/c823bf59-cd6b-4e0c-9600-690de08d3f18' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_iterator[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/c3433f82-a070-4044-be85-1d18786d1311' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_arraysize[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/64c7bec3-8281-4807-ae95-fb19ca8d0159' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_iceberg_table - pyathena.error.OperationalError: When reading information for key 'tmp/bbf31c91-e845-4ccf-8b8b-588147fcf4e7.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_arraysize[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/76e1c3c4-ed18-4a18-ad6f-3fe9dc5db8a1.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_arraysize[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/4f059472-9026-49ac-892f-cfd47e5eac81' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_complex_unload[arrow_cursor0] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/170fbcee-1269-4367-8d3e-b8e43f838b79' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_description[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/baf70311-3540-4dce-ae43-8aecc19566b1' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_query_execution[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/0723649b-bb5c-450b-baa2-d52ea3d8a7aa.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
An error occurred when I ran the test in the local environment. 🤔
This is not occurring in GitHubActions.
#520