Add write support for On Demand Feature Views
franciscojavierarceo opened this issue · 6 comments
The problem
As discussed in #4365, we should add the ability to write an On Demand Feature View (ODFV) to store the output of the calculation.
The solution
The ideal solution would require a boolean to the ODFV decorator as metadata to control the write behavior and another boolean in the get_online_features
method to allow for users to force features to be recomputed.
The writes would be done by calling push()
or write_to_online_store()
with the underlying raw data (inputs) and storing the transformed feature values (outputs) into the online store. The ODFV would be called before executing the writes.
The change for the ODFV definition would be:
@on_demand_feature_view(
sources=[
driver_hourly_stats_view,
input_request
],
schema=[
Field(name='conv_rate_plus_val1_python', dtype=Float64),
Field(name='conv_rate_plus_val2_python', dtype=Float64),
],
mode="python",
write_to_online_store=True, # THIS IS THE FIRST NEW BOOLEAN
)
def transformed_conv_rate_python(inputs: Dict[str, Any]) -> Dict[str, Any]:
output: Dict[str, Any] = {
"conv_rate_plus_val1_python": [
conv_rate + val_to_add
for conv_rate, val_to_add in zip(
inputs["conv_rate"], inputs["val_to_add"]
)
],
"conv_rate_plus_val2_python": [
conv_rate + val_to_add
for conv_rate, val_to_add in zip(
inputs["conv_rate"], inputs["val_to_add_2"]
)
]
}
return output
And the change for the get_online_features
call would be:
entity_rows = [
{
"driver_id": 1001,
"val_to_add": 1,
"val_to_add_2": 2,
}
]
online_response = store.get_online_features(
entity_rows=entity_rows,
features=[
"driver_hourly_stats:conv_rate",
"driver_hourly_stats:acc_rate",
"transformed_conv_rate_python:conv_rate_plus_val1_python",
"transformed_conv_rate_python:conv_rate_plus_val2_python",
],
force_compute=False, # THIS IS THE SECOND NEW BOOLEAN
).to_dict()
Again the write_to_online_store: bool
parameter would dictate whether this ODFV would write to the online store and the force_compute: bool
would dictate whether the ODFV would always recalculate the features. There's an argument to be made that we could skip the write_to_online_store
in the FeatureView declaration but this metadata would be useful to have in the registry for users.
The write call would be the standard:
store.push("transformed_conv_rate_python", entity_rows, to=PushMode.ONLINE)
# or alternative
store.write_to_online_store("transformed_conv_rate_python", entity_rows, to=PushMode.ONLINE)
Alternatives
We discussed creating a different feature view for this behavior altogether but using the existing ODFV benefits from reusing a lot of existing code and documentation. Moreover, the industry has adopted this language so adding on top of the language feels more natural than adding entirely new language.
Additional context
After the implementation it would be ideal to add this as an example in the local Credit Scoring tutorial.
There's an argument to be made that we could skip the
write_to_online_store
in the FeatureView declaration but this metadata would be useful to have in the registry for users.
I was actually thinking the other way around. If we manage to figure out how to use precomputed values only for identical inputs (plus I'm assuming that transformation function itself is deterministic), do we need a force_recompute
field? If a user is completely sure that recompute would yield the same result, what's the point of not using the precomputed value? force_recompute
in this example is also a bit awkward because it's configured for the whole request which might contain more than one odfvs, not per an individual odfv.
A couple of other points:
-
We probably need some sort of a ttl for storage. The hypothetical
feast cleanup
command could be the one that will clean up stale values afterwards. -
Another thing that came to my mind is how we would like to handle scenarios when not all features from a particular odfv are requested. I guess this is less of a problem for
pandas
andpython
transforms as they have to recompute the whole thing even if not all features are requested, but is a harder problem forsubstrait
.
I was actually thinking the other way around. If we manage to figure out how to use precomputed values only for identical inputs (plus I'm assuming that transformation function itself is deterministic), do we need a force_recompute field? If a user is completely sure that recompute would yield the same result, what's the point of not using the precomputed value? force_recompute in this example is also a bit awkward because it's configured for the whole request which might contain more than one odfvs, not per an individual odfv.
We could use an idempotency key for this, this was something we had planned to add in my last job for this exact use case.
We probably need some sort of a ttl for storage. The hypothetical feast cleanup command could be the one that will clean up stale values afterwards.
Yeah, implementing feast cleanup
handles it in general.
Another thing that came to my mind is how we would like to handle scenarios when not all features from a particular odfv are requested. I guess this is less of a problem for pandas and python transforms as they have to recompute the whole thing even if not all features are requested, but is a harder problem for substrait.
Agreed.
I think interface itself is pretty much stable at this point, so in that sense I agree. Still, from the overall ux perspective, there are still known limitations depending on the use case:
- pandas is slow both for offline and online.
- python doesn't yet support offline.
- substrait/ibis is decent for online, but not ideal for offline unless you're using one of the ibis-backed offline stores. plus, it's itself pretty much still experimental.
That's a good point.
I think adding Python offline is the only other necessary item. Substrait/ibis is broader scope and not as used by the community yet.