support apache arrow client
Opened this issue · 8 comments
Arrow is primarily an in memory columnar data format, but defines ways for applications to exchange such data efficiently. Many common data processing/analysis tools support working directly on arrow data efficiently (eg polars, duckdb)
At our company, we are seeing a pattern emerge where data scientists and analysts will first generate modest amounts of data (100s of mbs to 10s of gbs) with heavy trino queries, and then load that into something like polars or duckdb for further analysis. We would like to better support this pattern. One current pain point is that users must convert the row oriented trino data into arrow data. Arrow provides utilities for this, but that is not so efficient.
There are a few ways trino could return arrow data to users. One would be by implementing the arrow flight (and higher level arrow flight sql) grpc protocol. This seems like it could be a pretty big lift, and likely would require some extra up front design. It may not be possible to support an acceptable subset of trinos features on the existing arrow flight protocol. In particular, fetching data in arrow flight is a grpc endpoint, which doesnt fit well with trino's spooling protocol
An easier way might be to implement a java adbc connector for trino. ADBC is just a java interface and doesnt require any specific protocol. There are two ways we could implement and adbc connector. We could modify the direct protocol to return data in the arrow wire format, or we could enable trino to spool data in the arrow ipc format. Of course, these aren't mutually exclusive.
Is an arrow based trino client desirable? If so, how should that be implemented?
Duplicate of #18038, you could comment there if you have a different and/or more specific use case.
i think this is not exactly a duplicate, but i can move discussion there if you prefer
this ticket is more generally about having some kind of arrow based client for trino, whereas 18038 is specifically about arrow flight. its also not clear to me if 18038 is about creating an arrow flight connector for trino so it can fetch data from an arrow flight server, or implementing the arrow flight endpoints in trino so it can serve as an arrow flight server
Thanks, I see the distinction now. It might be actually a much smaller effort to use Arrow instead of JSON as the transport format, than implementing a client compatible with Arrow Flight SQL.
I think the biggest issue with Arrow is how to handle Trino data types that don't map to Arrow types. This would require special logic on the client side - we should enumerate such types and see if that would not invalidate the whole point of using Arrow, that is, if it would be simpler to convert all data on the client side.
Recent work on #22271 improved fetching data in the clients, so it should be possible now to benchmark the Arrow conversion done on both the client and server side.
On Slack you mentioned you already tried implementing an ADBC client, do you have any code you could share?
i agree that no matter what protocol is used, type conversion will be tricky. i took a pass at converting trino types to arrow types, and mostly they map well. the gaps are around (unsurprisingly) time and timestamp types. arrow only supports times up to nanosecond precision, and only supports times with timezones scoped to a file or block of data, not to the row like trino does
i handled these in my draft implementation by using arrow extension types. arrow extension types are a way to mark some arrow physical type as representing nonstandard logical type. in this case, i encoded picosecond precision times as a struct with a nano timestamp and pico adjustment, and times with timezone as a struct with a time and zone offset.
sure, i can make a draft PR. note, its very much a draft, untested and in progress
edit: oh, also hyperloglog, which i didnt know enough about to attempt. will have to figure out what to do for this
@dysn maybe it is worth checking if there is a chance to make arrow supporting this problematic types in arrow itself?
thats a good point, since migrating from extension types to natively supported arrow types would be a breaking change for clients. on the other hand, arrow is a big project, so adding new types may be slow
would you prefer that i open an issue with arrow? or would it be better for a core trino contributor to start that discussion? FWIW, i think opening up that line of communication is worthwhile, especially if trino ever wants to support arrow flight
I am not a maintainer at Trino or something like that but I think that it is worth opening an issue at arrow to see if adding types can be acceptable
At the same time having this conversation at Trino and start the work here. If there will be progress from arrow side that will be great and make things simple.
But it is worth to hear more from the Trino maintainers as they have more experience with integrating with other open-source projects and know more about Trino core