pytorch/torcharrow

Why does torcharrow use velox instead of acero (arrow compute)?

UranusSeven opened this issue · 5 comments

Hi guys,

Since torcharrow uses Arrow in-memory format to achieve zero-copy for external readers, I assume using acero is more intuitive, and the conversion from velox format to arrow format can avoid. So I am quite interested about why you choose velox instead of acero as the cpu backend?

Thanks!

The conversion between Arrow and Velox is also zero-copy. Velox is also collaborating with Arrow Community and Voltron Data, as announced in https://www.globenewswire.com/news-release/2022/06/23/2468271/0/en/Voltron-Data-Announces-Commitment-to-Improve-Interoperability-Between-Apache-Arrow-and-the-Velox-Open-Source-Project-Created-by-Meta.html

Quoted

... this collaboration will enable Velox to be a ubiquitous data processing workhorse for the coming generation of large-scale Arrow-enabled systems.

cc @pedroerp

Thanks, but I'm actually curious about the reasons you choose velox instead of arrow compute. Like for better performance, or for its extensibility, or any other reasons?

Outside of performance, Velox provides a much richer feature set. It is also the library in which all other data systems at Meta are being consolidation into, so there's a strong argument on consistency with the code executed in engines like Presto, Spark, stream processing systems, and more.

As a more concrete example, for instance, all these engines can now provide the same set of functions (UDFs) to users, with consistent signature and semantics (which is not possible if there implementations are siloed).

Thanks. Velox's feature set and its extensibility are what we are looking for, but the performance is also quite important for us. So I'm running TPC-H on Velox and arrow compute this week.

May I ask if there are any available official performance reports?

Sorry about the delay. We have been running many ad-hoc experiments with TPC-H and TPC-DS and results are looking very promising, but we don't have official numbers to publish yet (of course it also depends on the system we're comparing against).

Happy to give you more context on some of this offline.