/example-scripts

The official example scripts for the Numerai Data Science Tournament

Primary LanguageJupyter Notebook

███╗   ██╗██╗   ██╗███╗   ███╗███████╗██████╗  █████╗ ██╗
████╗  ██║██║   ██║████╗ ████║██╔════╝██╔══██╗██╔══██╗██║
██╔██╗ ██║██║   ██║██╔████╔██║█████╗  ██████╔╝███████║██║
██║╚██╗██║██║   ██║██║╚██╔╝██║██╔══╝  ██╔══██╗██╔══██║██║
██║ ╚████║╚██████╔╝██║ ╚═╝ ██║███████╗██║  ██║██║  ██║██║
╚═╝  ╚═══╝ ╚═════╝ ╚═╝     ╚═╝╚══════╝╚═╝  ╚═╝╚═╝  ╚═╝╚═╝

The official example scripts for the Numerai Data Science Tournament.

Contents

Quick Start

pip install -U pip && pip install -r requirements.txt
python example_model.py

The example script model will produce a validation_predictions.csv file which you can upload at https://numer.ai/tournament to get model diagnostics.

TIP: The example_model.py script takes ~45-60 minutes to run. If you don't want to wait, you can upload example_diagnostic_predictions.csv to get diagnostics immediately.

upload_diagnostics

If the current round is open (Saturday 18:00 UTC through Monday 14:30 UTC), you can submit your predictions and start getting results on live tournament data. You can create your submission by uploading the example_predictions.csv or generated tournament_predictions.csv file at https://numer.ai/tournament

upload_predictions

Datasets

numerai_training_data

  • Description: Labeled training data

  • Dimensions: ~2M rows x ~1K columns

  • Size: ~10GB CSV (float32 features), ~5GB CSV (int8 features), ~1GB Parquet (float32/int8 features)

  • Columns:

    • "id": string labels of obfuscated stock IDs

    • "era": string labels of points in time for a block of IDs

    • "data_type": string label "train"

    • "feature_...": integer or floating-point numbers, obfuscated features for each stock ID

    • "target_": floating-point numbers, various measures of returns for each stock ID

  • Notes: Check out the analysis_and_tips notebook for a detailed walkthrough of this dataset.

numerai_validation_data

  • Description: Labeled holdout set used to generate validation predictions and for computing validation metrics

  • Dimensions: ~540K rows x ~1K columns

  • Size: ~2.5GB CSV (float32 features), ~1.1GB (int8 features), ~210MB Parquet (float32/int8 features)

  • Columns:

    • "id": string labels of obfuscated stock IDs

    • "era": string labels of points in time for a block of IDs

    • "data_type": string label "validation"

    • "feature_...": floating-point numbers, obfuscated features for each stock ID

    • "target_": floating-point numbers, various measures of returns for each stock ID

  • Notes: It is highly recommended that you do not train on the validation set. This dataset is used to generate all validation metrics in the diagnostics API.

numerai_tournament_data

  • Description: Unlabeled feature data used to generate tournament predictions (updated weekly)

  • Dimensions: ~1.4M rows x ~1K columns

  • Size: ~6GB CSV (float32 features), ~2.1GB (int8 features), ~550MB Parquet (float32/int8 features)

  • Columns:

    • "id": string labels of obfuscated stock IDs

    • "era": string labels of points in time for a block of IDs

    • "data_type": string labels "test" and "live"

    • "feature_...": floating-point numbers, obfuscated features for each stock ID

    • "target_": NaN (not-a-number), intentionally left blank

  • Notes: Use this file to generate your tournament submission. This file changes every week, so make sure to download the most recent version of this file each round.

numerai_live_data

  • Description: Unlabeled feature data used to generate live predictions only (updated weekly)

  • Dimensions: 5.3K rows x ~1K columns

  • Size: ~24MB CSV (float32 features), ~11MB CSV (int8 features), ~3MB Parquet (float32/int8 features)

  • Columns:

    • "id": string labels of obfuscated stock IDs

    • "era": string labels of points in time for a block of IDs

    • "data_type": string labels "test" and "live"

    • "feature_...": floating-point numbers, obfuscated features for each stock ID

    • "target_": NaN (not-a-number), intentionally left blank

  • Notes: Use this file to generate the live only portion of your tournament submission if your test predictions are not changing and saved. This file changes every week, so make sure to download the most recent version of this file each round.

example_validation_predictions

  • Description: The predictions generated by the example_model on the numerai_validation_data

  • Dimensions: ~540K rows x 1 column

  • Size: ~14MB CSV

  • Columns:

    • "id": string labels of obfuscated stock IDs

    • "prediction": floating-point numbers between 0 and 1 (exclusive)

  • Notes: Useful for ensuring you can get diagnostics and debugging your prediction file if you receive an error from the diagnostics API. This is what your uploads to diagnostics should look like (same ids and data types).

example_predictions

  • Description: The predictions generated by the example_model on the numerai_tournament_data

  • Dimensions: ~1.4M rows x 1 column

  • Size: ~37MB CSV

  • Columns:

    • "id": string labels of obfuscated stock IDs

    • "prediction": floating-point numbers between 0 and 1 (exclusive)

  • Notes: Useful for ensuring you can make a submission and debugging your prediction file if you receive an error from the submissions API. This is what your submissions should look like (same ids and data types).

old_data_new_val

  • Description: The legacy validation data mapped onto the new validation period

  • Dimensions: ~540K rows x ~310 columns

  • Size: ~69MB Parquet

  • Columns:

    • "id": string labels of obfuscated stock IDs

    • "era": string labels of points in time for a block of IDs

    • "data_type": string label "validation"

    • "feature_...": floating-point numbers, obfuscated features for each stock ID

    • "target_": floating-point numbers, various measures of returns for each stock ID

  • Notes: Run your legacy models (models trained on the legacy dataset) against this file to generate validation predictions that are comparable to your new models (models trained on the new dataset).

Next Steps

Research

The example model is a good baseline model, but we can do much better. Check out example_model_advanced for the best model made by Numerai's internal research team (takes 2~3 hours to run!) and learn more about the underlying concepts used to construct the advanced example model in the analysis_and_tips notebook.

Check out the forums for in depth discussions on model research.

Staking

Once you have a model you are happy with, you can stake NMR on it to start earning rewards.

Head over to the website to get started or read more about staking in our official rules and getting started guide.

Automation

You can upload your predictions directly to our GraphQL API or through the Python client.

To access the API, you must first create your API keys in your account page and provide them to the client:

example_public_id = "somepublicid"
example_secret_key = "somesecretkey"
napi = numerapi.NumerAPI(example_public_id, example_secret_key)

After instantiating the NumerAPI client with API keys, you can then upload your submissions programmatically:

# upload predictions
model_id = napi.get_models()['your_model_name']
napi.upload_predictions("tournament_predictions.csv", model_id=model_id)

The recommended setup for a fully automated submission process is to use Numerai Compute. Please see the Numerai CLI documentation for instructions on how to deploy your models to AWS.

FAQ

What are the system requirements?

  • Minimum: 16GB RAM and 4 core (Intel i5) / 6 cores (AMD Ryzen 5)
  • Recommended: 32GB RAM and 8 core (Intel i7/i9) / 12 cores (AMD Ryzen 7/9)

What exactly is in the Numerai dataset?

The Numerai Dataset contains decades of historical data on the global stock market. Each era represents a time period and each id represents a stock. The features are made from market and fundamental measures of the companies, and the targets are a measure of return.

The stock ids, features, and targets are intentionally obfuscated.

How often is the dataset updated?

The historical portions of the dataset (training_data, validation_data) are relatively static and is updated about every 3-6 months, usually with just more rows.

The live portion of the dataset (tournament_data) is updated every week and represents the latest state of the global stock market.

What is Parquet?

Parquet is an efficient and performant file format that is IO optimized for reading in subsets of columns at a time.

Use the parquet versions (instead of the standard CSV) of the dataset files to minimize time spent on IO (downloading and reading the file into memory).

Use the int8 version (features are stored as int8 instead of the standard float32) of the parquet file to further minimize memory usage.

What is the "new" vs "legacy" dataset?

In September of 2021, Numerai released a new version of the dataset. Read more about it here.

Models trained on the legacy dataset will continue to work, but it is highly recommended that everyone upgrade to the new dataset because of the major performance improvements.

All example code in this repo has been updated to work with the new dataset only.

Where can I find this legacy dataset?

You can continue to download the legacy dataset from the website and the API, but it will be eventually deprecated.

Use the dataset query in the GraphQL API without passing any round number to download the legacy dataset zip.

How should I migrate my legacy models to the new dataset?

The easiest way to get started with the new dataset is to check out the new example models and analysis and tips notebook in this repo.

Also check out this deep dive on the new dataset in the forum.

Support

If you need help or have any questions, please connect with us on our community chat or forums.

If something in this repo doesn't work, please file an issue.