-
This project presents a feature store approach using Pyspark, within a simple, flexible and reliable structure.
-
All components are containerized, such as Spark, Cassandra, Hadoop, Hive, PostgreSQL, by docker-compose.
-
For the restricted, and limited development time, I do not inventing the wheel, so I'm using a framework that abstracts a good part of the Pyspark components, such as Kafka integration, DataFrames transformations, and saving it into data storages. From a pipeline structure, all Spark transformations can be organized and applied both for online and offline features. By the end, the only Spark action is to save the data into two data destinations, hive metastore for offline, and Cassandra for online features.
-
The Butterfree (https://github.com/quintoandar/butterfree) is an open-source framework, developed by QuintoAndar, and serves well for prototype and when the scope is small or medium in size.
Clone repository:
git clone https://github.com/rafaelsandroni/spark-based-feature-store
cd Spark-based-feature-store
Docker-compose:
docker-compose up
- spark-app will run the pipeline
Execution:
- When you run docker-compose, the spark-app container will run the pipeline and show, at the end, a query consuming data from online and offline features.
- But you can open the notebooks (in ./notebooks), and do it yourself.
Project structure (into spark-based-feature-store folder):
notebooks
src
conf
connectors
pipeline.py
extract.py
tranform.py
load.py
Ports:
TCP 4041, Spark UI