Add support for Feature Transformation on Writes for On Demand Feature Views

Question

Add support for Feature Transformation on Writes for On Demand Feature Views

franciscojavierarceo opened this issue a year ago · 6 comments

franciscojavierarceo commented a year ago

The problem

As discussed in #4365, we should add the ability to write an On Demand Feature View (ODFV) to store the output of the calculation.

The solution

The ideal solution would require a boolean to the ODFV decorator as metadata to control the write behavior and another boolean in the get_online_features method to allow for users to force features to be recomputed.

The writes would be done by calling push() or write_to_online_store() with the underlying raw data (inputs) and storing the transformed feature values (outputs) into the online store. The ODFV would be called before executing the writes.

The change for the ODFV definition would be:

@on_demand_feature_view(
    sources=[
        driver_hourly_stats_view,
        input_request
    ],
    schema=[
        Field(name='conv_rate_plus_val1_python', dtype=Float64),
        Field(name='conv_rate_plus_val2_python', dtype=Float64),
    ],
    mode="python",
    write_to_online_store=True,                #  THIS IS THE FIRST NEW BOOLEAN
)
def transformed_conv_rate_python(inputs: Dict[str, Any]) -> Dict[str, Any]:
    output: Dict[str, Any] = {
        "conv_rate_plus_val1_python": [
            conv_rate + val_to_add
            for conv_rate, val_to_add in zip(
                inputs["conv_rate"], inputs["val_to_add"]
            )
        ],
        "conv_rate_plus_val2_python": [
            conv_rate + val_to_add
            for conv_rate, val_to_add in zip(
                inputs["conv_rate"], inputs["val_to_add_2"]
            )
        ]
    }
    return output

And the change for the get_online_features call would be:

entity_rows = [
    {
        "driver_id": 1001,
        "val_to_add": 1,
        "val_to_add_2": 2,
    }
]

online_response = store.get_online_features(
    entity_rows=entity_rows,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "transformed_conv_rate_python:conv_rate_plus_val1_python",
        "transformed_conv_rate_python:conv_rate_plus_val2_python",
    ],
    force_compute=False,                         #  THIS IS THE SECOND NEW BOOLEAN
).to_dict()

Again the write_to_online_store: bool parameter would dictate whether this ODFV would write to the online store and the force_compute: bool would dictate whether the ODFV would always recalculate the features. There's an argument to be made that we could skip the write_to_online_store in the FeatureView declaration but this metadata would be useful to have in the registry for users.

The write call would be the standard:

store.push("transformed_conv_rate_python", entity_rows, to=PushMode.ONLINE)
# or alternative
store.write_to_online_store("transformed_conv_rate_python", entity_rows, to=PushMode.ONLINE)

Alternatives

We discussed creating a different feature view for this behavior altogether but using the existing ODFV benefits from reusing a lot of existing code and documentation. Moreover, the industry has adopted this language so adding on top of the language feels more natural than adding entirely new language.

Additional context

After the implementation it would be ideal to add this as an example in the local Credit Scoring tutorial.

Answer 1 · 2024-08-02T05:26:07.000Z

There's an argument to be made that we could skip the write_to_online_store in the FeatureView declaration but this metadata would be useful to have in the registry for users.

I was actually thinking the other way around. If we manage to figure out how to use precomputed values only for identical inputs (plus I'm assuming that transformation function itself is deterministic), do we need a force_recompute field? If a user is completely sure that recompute would yield the same result, what's the point of not using the precomputed value? force_recompute in this example is also a bit awkward because it's configured for the whole request which might contain more than one odfvs, not per an individual odfv.

A couple of other points:

We probably need some sort of a ttl for storage. The hypothetical feast cleanup command could be the one that will clean up stale values afterwards.
Another thing that came to my mind is how we would like to handle scenarios when not all features from a particular odfv are requested. I guess this is less of a problem for pandas and python transforms as they have to recompute the whole thing even if not all features are requested, but is a harder problem for substrait.

Answer 2 · 2024-08-02T12:31:40.000Z

I was actually thinking the other way around. If we manage to figure out how to use precomputed values only for identical inputs (plus I'm assuming that transformation function itself is deterministic), do we need a force_recompute field? If a user is completely sure that recompute would yield the same result, what's the point of not using the precomputed value? force_recompute in this example is also a bit awkward because it's configured for the whole request which might contain more than one odfvs, not per an individual odfv.

We could use an idempotency key for this, this was something we had planned to add in my last job for this exact use case.

We probably need some sort of a ttl for storage. The hypothetical feast cleanup command could be the one that will clean up stale values afterwards.

Yeah, implementing feast cleanup handles it in general.

Another thing that came to my mind is how we would like to handle scenarios when not all features from a particular odfv are requested. I guess this is less of a problem for pandas and python transforms as they have to recompute the whole thing even if not all features are requested, but is a harder problem for substrait.

Agreed.

Answer 3 · 2024-08-05T17:45:33.000Z

FYI @tokoko I had created a ticket for this before actually here: #4077

Answer 4 · 2024-08-14T12:35:26.000Z

@tokoko @HaoXuAI @shuchu once this is done I'll consider ODFVs to be out of beta and a complete feature.

Answer 5 · 2024-08-14T13:37:31.000Z

I think interface itself is pretty much stable at this point, so in that sense I agree. Still, from the overall ux perspective, there are still known limitations depending on the use case:

pandas is slow both for offline and online.
python doesn't yet support offline.
substrait/ibis is decent for online, but not ideal for offline unless you're using one of the ibis-backed offline stores. plus, it's itself pretty much still experimental.

Answer 6 · 2024-08-14T13:56:45.000Z

That's a good point.

I think adding Python offline is the only other necessary item. Substrait/ibis is broader scope and not as used by the community yet.