/feathub

FeatHub - A stream-batch unified feature store for real-time machine learning

Primary LanguagePythonApache License 2.0Apache-2.0

Feathub

Feathub is a feature store that facilitates feature development and deployment to achieve the following objectives:

  • Reduce duplication of data engineering efforts by allowing new ML projects to reuse and share a library of curated production-ready features already registered by existing projects in the same organization.
  • Simplify feature management by allowing users to specify feature definitions and feature processing jobs as code using a declarative framework.
  • Facilitate feature development-to-deployment iteration by allowing users to use the same declarative feature definitions across training and serving, online and offline, without training-serving skew. Feathub takes care of compiling feature definitions into efficient processing jobs and executing those jobs in a distributed cluster.

Feathub provides SDK and infra that enable the following capabilities:

  • Define feature-view (a group of related features) as transformations and joins of the existing feature-views and data sources.
  • Register and retrieve feature-views by names from feature registry.
  • Transform and materialize features for the given time range and/or keys from the feature view into feature stores, by applying transformations on source dataset with point-in-time correctness.
  • Fetch online features by joining features from online feature store with on-demand transformations.

Architecture and Concepts

The figure above shows the key components and concepts in the FeatHub architecture.

The figure above shows how a developer can use FeatHub to facilitate feature engineering for model training and model inference. Please checkout Feathub architecture and concepts for more details.

Getting Started

Prerequisites

Prerequisites for building python packages:

  • Unix-like operating system (e.g. Linux, Mac OS X)
  • x86_64 architecture
  • Python 3.7
  • Java 8
  • Maven >= 3.1.1

Install Feathub

Install Nightly Build

Run the following command to install preview(nightly) version of Feathub.

$ python -m pip install feathub-nightly

Install From Source

Run the following command to install Feathub from source.

# Build Java dependencies for Feathub 
$ cd java
$ mvn clean package -DskipTests
$ cd ..

# Install Feathub
$ python -m pip install ./python

Supported Processor

Feathub provides different processors to compute features with different computation engine. Here is a list of supported processors and versions:

  • Local
  • Flink 1.15.2

Quickstart

Quickstart using Local Processor

Execute the following command to run the nyc_taxi.py demo which demonstrates the capabilities described above.

$ python python/feathub/examples/nyc_taxi.py

Quickstart using Flink Processor

If you are interested in computing the Feathub features with Flink processor in a local Flink cluster. You can see the following quickstart with different deployment modes:

Highlighted Capabilities

Define Features via Table Joins with Point-in-Time Correctness

f_price = Feature(
    name="price",
    dtype=types.Float32,
    transform=JoinTransform(
        table_name="price_update_events",
        feature_name="price"
    ),
    keys=["item_id"],
)

Define Over Window Aggregation Features

f_total_payment_last_two_minutes = Feature(
    name="total_payment_last_two_minutes",
    dtype=types.Float32,
    transform=OverWindowTransform(
        expr="item_count * price",
        agg_func="SUM",
        window_size=timedelta(minutes=2),
        group_by_keys=["user_id"]
    )
)

Define Sliding Window Aggregation Features

f_total_payment_last_two_minutes = Feature(
    name="total_payment_last_two_minutes",
    dtype=types.Float32,
    transform=SlidingWindowTransform(
        expr="item_count * price",
        agg_func="SUM",
        window_size=timedelta(minutes=2),
        step_size=timedelta(minutes=1),
        group_by_keys=["user_id"]
    )
)

Define Features via Built-in Functions

f_trip_time_duration = Feature(
    name="f_trip_time_duration",
    dtype=types.Int32,
    transform="UNIX_TIMESTAMP(taxi_dropoff_datetime) - UNIX_TIMESTAMP(taxi_pickup_datetime)",
)

Define Feature via Python UDF

f_lower_case_name = Feature(
    name="lower_case_name",
    dtype=types.String,
    transform=PythonUdfTransform(lambda row: row["name"].lower()),
)

Additional Resources

  • This tutorial provides more details on how to define, extract and serve features using Feathub.
  • This document explains the Feathub expression language.
  • This document introduces the Flink processor that computes the features with Flink.
  • feathub-examples provides additional FeatHub quickstarts that cover key FeatHub APIs and functionalities.

Developer Guidelines

Install Development Dependencies

$ python -m pip install -r python/dev-requirements.txt

Running All Tests

$ pytest -W ignore::DeprecationWarning

Code Formatting

Feathub uses Black to format Python code, flake8 to check Python code style, and mypy to check type annotation.

Run the following command to format codes, check code style, and check type annotation before uploading PRs for review.

# Format python code
$ python -m black python

# Check python code style
$ python -m flake8 --config=python/setup.cfg python

# Check python type annotation
$ python -m mypy --config-file python/setup.cfg python

Roadmap

Here is a list of key features that we plan to support. Stay tuned!

  • Support all FeatureView transformations with FlinkProcessor
  • Support all FeatureView transformations with LocalProcessor
  • Support common online and offline feature storages (e.g. MaxCompute, Redis, HDFS)
  • Support online transformation with feature service
  • Support integration with Notebook
  • Support feature metadata exploration (e.g. definition, lineage, metrics) with FeatHub UI
  • Support feature monitoring

Contact Us

Chinese-speaking users are recommended to join the following DingTalk group for questions and discussion.

English-speaking users can use this invitation link to join our Slack channel for questions and discussion.

We are actively looking for user feedback and contributors from the community. Please feel free to create pull requests and open Github issues for feedback and feature requests.

Come join us!