apache/datafusion-ballista

[EPIC] Add support for Substrait

andygrove opened this issue · 1 comments

[EDIT: Updated this on 2/25/23]

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The substrait standard is gaining adoption and I would like to add support to Balllista. There are three different areas where we could potentially support Substrait:

  • ExecuteQueryParams currently accepts either LogicalPlan or a SQL string. We could add Substrait here as well, represented as a byte array. This would allow clients such as Ibis to submit queries directly to Ballista's gRPC service.
  • The executor currently receives tasks containing DataFusion physical plans. These plans could be serialized to Substrait and passed to other execution engines, such as DuckDB, Polars, and cuDF, making Ballista a general-purpose distributed query scheduler.
  • We currently use a proprietary protobuf format for representing plans in protobuf format. We could adopt Substrait here as well, or maybe just add a wrapper for Substrait plans.

Original description:

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Ballista (and DataFusion) has a proprietary protobuf-based format for serializing query plans. This really ties Ballista to DataFusion and does not allow other query engines and/or compute kernals to be used easily.

Describe the solution you'd like
There is now an emerging standard for query plan serialization at https://substrait.io/ and this is also protobuf-based. It would be good to move towards this over time.

Describe alternatives you've considered
None

Additional context
None

Substrait support is now in DataFusion, so I plan on working on this soon