icelake-io/icelake

support transform expression

Closed this issue · 6 comments

We need the transform expression to support partition write.

At high level, we need interface like:

fn partition_records(record_batch,partition_spec) -> HashMap<StructValue,RecordBatch>;

and for implementation, it has the process to compute the transform looks like:

fn partition_records(record_batch,partition_spec) -> HashMap<StructValue,RecordBatch> {
    ...
    for partition_field in partition_spec {
        // 1. Get the transform expression 
        let expr = transform_expr(partition_field.transform);
        // 2. Get source column 
        let column = record_batch.column[partition_field.source_field];
        // 3. Transform compute 
        let res_column = expr.eval(column);
    }
    ...
}

So we need the expression interface and implement it for kinds of transform, like

trait Expr {
    fn eval(&self,array: ArrayRef) -> ArrayRef
}

impl Expr for Identity {
    fn eval(&self,array: ArrayRef) {
        array 
    }
}

...

For the implementation of expression, we can make use of the compute module of arrow. I'm taking investiage for it.

Xuanwo commented

Transform types are limited, I prefer to implement as an enum instead.

I think using a trait would be better since transform has different input/output data type combinations, for example, bucket and truncate may have different types. But the name Expression maybe too general, I think sth like TransformFunction would be better.
BTW, arrow has computing kernel, which maybe reused when we implement this functions: https://docs.rs/arrow-arith/44.0.0/arrow_arith/aggregate/index.html

We can use pub type ArrayRef = Arc<dyn Array>; from arrow so that it will not has the different type problem.

/// trait
trait TransformFunction {
  fn transform(batch:&ArrayRef) -> &ArrayRef 
}

I prefer trait because so that it can make the transform implementation be more maintainable. And seems it will not add too much complexity to code compare to enum.🤔

Xuanwo commented

I think sth like TransformFunction would be better.

LGTM

I prefer trait because so that it can make the transform implementation be more maintainable. And seems it will not add too much complexity to code compare to enum.

Cool, let's do this.

All transform function have been implemented.