svoe: A Python repository from anovv

What is SVOE?

SVOE is a low-code declarative framework providing scalable and highly configurable pipelines for streaming and batch feature engineering, predictive model training, real-time inference and backtesting. Built on top of Ray, the framework allows to build and scale your custom pipelines from multi-core laptop to a cluster of 1000s of nodes.

SVOE was originally built to accommodate a typical financial data research workflow (i.e. for Quant Researchers) with specific data models in mind (trades, quotes, order book updates, etc., hence some examples are provided in this domain), however the framework itself is domain-agnostic and it's components can easily be generalised and used in other fields which rely on real-time time-series based data processing and simulation(anomaly detection, sales forecasting etc.)

How does it work?

SVOE consists of three main components, each providing a set of tools for a typical Quant/ML engineer workflow

Featurizer helps defining, calculating and storing real-time/offline (batch) features. It uses custom stream processing engine (Ray Actors + ZeroMQ) and Kappa-architecture to calculate offline features using online pipelines
Trainer allows training predictive models in distributed setting using popular ML libraries (XGBoost, PyTorch)
Backtester is used to validate and test predictive models along with user defined logic (i.e. trading strategies if used in financial domain)

You can read more in docs

Why use SVOE?

Easy to use standardized and flexible data and computation models - seamlessly switch between real-time and historical data for feature engineering, ML training and backtesting
Low code, modularity and configurability - define reusable components such as FeatureDefinition, DataSourceDefinition, FeaturizerConfig, TrainerConfig, BacktesterConfig etc. to easily run your experiments
Avoid train-predict inconsistency - Featurizer uses same feature definition for real-time inference and batch training
No need for external data infra/DWH - Featurizer Storage allows to store and catalog computed features in any object storage while keeping index in any SQL backend, provides Data Exploration API
Ray integration - SVOE runs wherever Ray runs (everywhere!)
MLFlow integration - store, retrieve and analyze your ML models with MLFlow API
Cloud / Kubernetes ready - use KubeRay or native Ray on AWS to scale out your workloads in a cloud
Easily integrates with orchestrators (Airflow, Luigi, Prefect) - SVOE provides basic Airflow Operators for each component to easily orchestrate your workflows
Real-time inference without MLOps burden - no need to maintain model containerization pipelines, FastAPI services and model registries. Deploy with simple Python API or yaml using InferenceLoop
Designed for high volume low granularity data - as an example, when used in financial domain, unlike existing financial ML frameworks which use only OHLCV as a base data model, SVOE's Featurizer provides flexible tools to use and customize any data source (ticks, trades, book updates, etc.) and build streaming and historical features
Minimized number of external dependencies - SVOE is built using Ray Core primitives and has no heavyweight external dependencies (stream processor, distributed computing engines, storages, etc.) which allows for easy deployment, maintenance and minimizes costly data transfers. The only dependency is an SQL database of user's choice. And it's all Python!

Installation

Install from PyPi. Be aware that Svoe requires Python 3.10+.

pip install svoe

For local environment launch standalone setup on your laptop. This will start local Ray cluster, create and populate SQLite database, spin up MLFlow tracking server and load sample data from remote store (S3). Make sure you have all necessary dependencies present

svoe standalone

For distributed setting, please refer to Running on remote clusters

Quick Start

For this example, we will consider a scenario which often occurs in financial markets simulation, however please note that the framework is not limited to financial data and can be used with whatever scenario user provides. As an example, here is a simple 3 step tutorial to build a simple mid-price prediction model based on past price and volatility.

Run Featurizer to construct mid-price and volatility features from partial order book updates, 5 second lookahead label as prediction target, using 1 second granularity data

Define featurizer-config.yaml

start_date: '2023-02-01 10:00:00'
end_date: '2023-02-01 11:00:00'
label_feature_index: 0
label_lookahead: '5s'
features_to_store: [0, 1]
feature_configs:
  - feature_definition: price.mid_price_fd.MidPriceFD
    name: mid_price
    params:
      data_source: &id001
        - exchange: BINANCE
          instrument_type: spot
          symbol: BTC-USDT
      feature:
        sampling: 1s
  - feature_definition: volatility.volatility_stddev_fd.VolatilityStddevFD
    params
      data_source: *id001
      feature:
        sampling: 1s

See MidPriceFD and VolatilityStddevFD for implementation details

Run Featurizer
- CLI: svoe featurizer run <path_to_config> --ray-address <addr> --parallelism <num-workers>
- Python API: Featurizer.run(path=<path_to_config>, ray_address=<addr>, parallelism=<num_workers>)

Once calculation is finished, load sampled FeatureLabelSet dataframe to your local client

CLI: svoe featurizer get-data --every-n <every_nth_row>
Python API: Featurizer.get_materialized_data(pick_every_nth_row=<every_nth_row>)

      timestamp  receipt_timestamp  label_mid_price-mid_price  mid_price-mid_price  feature_VolatilityStddevFD_62271b09-volatility
0     1.675234e+09       1.675234e+09                  23084.800            23084.435                                        0.000547
1     1.675234e+09       1.675234e+09                  23083.760            23084.355                                        0.040003
2     1.675234e+09       1.675234e+09                  23083.505            23084.635                                        0.117757
3     1.675234e+09       1.675234e+09                  23084.610            23085.020                                        0.257091
4     1.675234e+09       1.675234e+09                  23084.725            23084.800                                        0.242034
...            ...                ...                        ...                  ...                                             ...

We can also visualize the results
- CLI: svoe featurizer plot --every-n <every_nth_row>

Once we have our FeatureLabelSet calculated and loaded in cluster memory, let's use Trainer to train XGBoost model to predict mid-price 5 seconds ahead, validate the model, tune hyperparams and pick best model

Define config

xgboost:
  params:
    tree_method: 'approx'
    objective: 'reg:linear'
    eval_metric: [ 'logloss', 'error' ]
  num_boost_rounds: 10
  train_valid_test_split: [0.5, 0.3]
num_workers: 3
tuner_config:
  param_space:
    params:
      max_depth:
        randint:
          lower: 2
          upper: 8
      min_child_weight:
        randint:
          lower: 1
          upper: 10
  num_samples: 8
  metric: 'train-logloss'
  mode: 'min'
max_concurrent_trials: 3

Run Trainer

CLI: svoe trainer run --config-path <config-path> --ray-address <addr>

Python API:

config = TrainerConfig.load_config(config_path)
trainer_manager = TrainerManager(config=config, ray_address=ray_address)
trainer_manager.run(trainer_run_id='sample-run-id', tags={})

Visualize predictions
- CLI: svoe trainer predictions --model-uri <model-uri>

Select best model

CLI: svoe trainer best-model --metric-name valid-logloss --mode min

Python API:

mlflow_client = SvoeMLFlowClient()
best-model-uri = mlflow_client.get_best_checkpoint_uri(metric_name=metric_name, experiment_name=experiment_name, mode=mode)

In this example, we use Backtester in the context of financial markets, hence our user-defined logic is based on a notion of trading strategy. This can be extended to any other scenario which user wants to emulate. Once we have our best model, we can plug it in our BaseStrategy derived class and run backtester to simulate our scenario

Define config

featurizer_config_path: featurizer-config.yaml
inference_config:
  model_uri: <your-best-model-uri>
  predictor_class_name: 'XGBoostPredictor'
  num_replicas: <number-of-predictor-replicas>
simulation_class_name: 'backtester.strategy.ml_strategy.MLStrategy'
simulation_params:
  buy_delta: 0
  sell_delta: 0
user_defined_params:
  portfolio_config: <portfolio_config>
  tradable_instruments_params:
    - exchange: 'BINANCE'
      instrument_type: 'spot'
      symbol: 'BTC-USDT'

See MLStrategy for example implementation

Run Backtester
- CLI: svoe backtester run --config-path <config-path> --ray-address <addr> --num-workers <num-workers>
- Python API:
```
config = BacktesterConfig.load_config(config_path)
backtester = Backtester.from_config(config)
backtester.run_remotely(ray_address=ray_address, num_workers=num_workers)
```
This will run a distributed event-driven backtest using features and models defined earlier
Get stats with backtester.get_stats()

Documentation

We try to maintain as fresh and detailed docs as possible. Please leave your feedback if you have any questions.

Contributions

SVOE is an open-source first project and we would love to get feedback and contributions from the community! The project is in a very early stage and is still a work in progress, so any help would be greatly appreciated! Please feel free to open GitHub issues with questions/bugs or PRs with contributions!