/numerai_signals_pipeline

Downloads data from Yahoo Finance, generates features, trains a model and submits the predictions to the tournament.

Primary LanguagePython

Numerai Signals Pipeline

Downloads data from Yahoo Finance, generates features, trains a model and submits the predictions.

Running the pipeline

python launcher.py <your_properties_file>.json

The json properties file must contain 3 keys:

{
    "model_id": "xyz", 
    "public_id": "xyz",
    "secret_key": "xyz"
}

Corresponding to:

  • model_id: Numerai ID of the model we want to submit predictions to.
  • public_id: Our public Numerai key.
  • secret_key: Our private Numerai key.

Make sure your properties file is added to the .gitignore as contains sensitive data.

Output data

Once the pipeline finishes, there will be 3 folders with data files:

  • db_raw_downloaded: Contains data downloaded from our source (currently Yahoo Finance). We keep it as there is an option to run the pipeline with already downloaded data.
  • db_ml_csv: Contains data to train, validate and predict. We can use this file to improve training or try other models outside this pipeline.
  • db_predictions: Contains a file that will be submitted automatically to the indicated model. We can also manually upload it as a diagnostics file.

If we want to remove the data from these 3 folders, we will have to do it manually.

Configurations

There are 3 configurations in the configuration python file. There are parameters we might want to configure:

Submission Configuration:
  • skip_check_needs_submission: Skips running the pipeline if that model has already been submitted.
  • numerai_submit: Submits (or not) predictions using numerapi.
Indicators Configuration:
  • indicators_static: List of indicators with a non-configurable time interval.
  • indicators_dynamic: List of indicators with configurable time intervals.
  • static_lags: List of lags to create features for the indicators_static. The numbers are multiplied by the non-configurable time interval and casted to integer. Then shifts using the result value.
  • dynamic_lags: List of lags to create features for the indicators_dynamic. Just shifts.
  • dynamic_windows: Time intervals for the indicators_dynamic.
Data Configuration:
  • skip_download: Runs the pipeline without downloading the data from the provider (useful if we already did that).
  • static_data: Uses a folder with already downloaded data (train, validation and targets). Useful to train and validate at any time with the same data (needs to be placed in the folder manually). When using this option, live tickers to submit will be 0 (except if the 'static data' is the same as the one we get with a regular download during that week).
  • target_name: Name of the target column.
  • raw_data: Data provider name (currently just accepts 'yahoo').
  • transformation_type: Used to build features from the indicators. Currently accepts binning, zscoring and ranking.

Targets

We are creating our custom target but the pipeline uses the one provided by Numerai.

Future lines

Some ideas thay can be integrated in the pipeline or implemented using data generated by it:

  • Mean Decrease Accuracy for feature selection.
  • Era wise Time Series Purged Cross Validation for hyper-parameter tuning.
  • Add regularization to Z-Score and Rank transformations.
  • Try paid sources of data to generate features based on fundamentals or improve the quality of the ones based on price and volume.
  • Remove last x train eras or first x validation ones to avoid data leakage on the validation metrics.