support transform expression
Closed this issue · 6 comments
We need the transform expression to support partition write.
At high level, we need interface like:
fn partition_records(record_batch,partition_spec) -> HashMap<StructValue,RecordBatch>;
and for implementation, it has the process to compute the transform looks like:
fn partition_records(record_batch,partition_spec) -> HashMap<StructValue,RecordBatch> {
...
for partition_field in partition_spec {
// 1. Get the transform expression
let expr = transform_expr(partition_field.transform);
// 2. Get source column
let column = record_batch.column[partition_field.source_field];
// 3. Transform compute
let res_column = expr.eval(column);
}
...
}
So we need the expression interface and implement it for kinds of transform, like
trait Expr {
fn eval(&self,array: ArrayRef) -> ArrayRef
}
impl Expr for Identity {
fn eval(&self,array: ArrayRef) {
array
}
}
...
For the implementation of expression, we can make use of the compute module of arrow. I'm taking investiage for it.
Transform types are limited, I prefer to implement as an enum instead.
I think using a trait would be better since transform has different input/output data type combinations, for example, bucket
and truncate
may have different types. But the name Expression
maybe too general, I think sth like TransformFunction
would be better.
BTW, arrow has computing kernel, which maybe reused when we implement this functions: https://docs.rs/arrow-arith/44.0.0/arrow_arith/aggregate/index.html
We can use pub type ArrayRef = Arc<dyn Array>;
from arrow so that it will not has the different type problem.
/// trait
trait TransformFunction {
fn transform(batch:&ArrayRef) -> &ArrayRef
}
I prefer trait because so that it can make the transform implementation be more maintainable. And seems it will not add too much complexity to code compare to enum.🤔
I think sth like
TransformFunction
would be better.
LGTM
I prefer trait because so that it can make the transform implementation be more maintainable. And seems it will not add too much complexity to code compare to enum.
Cool, let's do this.
All transform function have been implemented.