description |
---|
Everything you need to know to get started in under 5 minutes |
Overview
Introduction
The Numerai Data Science tournament is where data scientists from around the world build machine learning models on Numerai's obfuscated financial dataset to predict the stock market.
If you are just getting started, this is the overview for you!
Data
The Numerai dataset is a tabular dataset that describes the global stock market over time.
Each row represents a stock at a specific point in time, where id
is the stock id and the era
is the date. The features
describe the known attributes of the stock at the time (eg. P/E ratio) and the target
represents a measure of future returns (eg. after 20 days).
Here is how to download the data in Python using NumerAPI:
{% code lineNumbers="true" %}
# NumerAPI is the official Python client
from numerapi import NumerAPI
napi = NumerAPI()
# Download training data
napi.download_dataset("v4.1/train.parquet")
training_data = pd.read_parquet("v4.1/train.parquet")
{% endcode %}
See the Data section for more details and examples.
Modeling
Your objective is to build machine learnings models to predict the target.
Here is an example model in Python using LightGBM:
import lightgbm as lgb
model = lgb.LGBMRegressor(
n_estimators=2000,
learning_rate=0.01,
max_depth=5,
num_leaves=2 ** 5,
colsample_bytree=0.1
)
model.fit(
training_data[[f for f in training_data.columns if "feature" in f]],
training_data["target"]
)
See the Models section for more advanced example models.
Submissions
The tournament is organized into rounds starting Saturday, Tuesday, Wednesday, Thursday and Friday every week. Each round goes through 4 stages over the span of a month:
- Open: live features released and submission window open
- Closed: submission window closed
- Scoring: submissions begin scoring
- Resolved: scoring complete and payouts resolved
To compete in the tournament you must submit live predictions in every round. Here is an example in Python using NumerAPI:
# Authenticate
napi = numerapi.NumerAPI("api-public-id", "api-secret-key")
# Get current round
current_round = napi.get_current_round()
# Download latest live features
napi.download_dataset(f"v4.1/live_{current_round}.parquet")
live_data = pd.read_parquet(f"v4.1/live_{current_round}.parquet")
live_features = live_data[[f for f in live_data.columns if "feature" in f]]
# Generate live predictions
live_predictions = model.predict(live_features)
live_predictions.to_csv(f"prediction_{current_round}.csv")
# Submit predictions
napi.upload_predictions(f"prediction_{current_round}.csv", model_id="your-model-id")
See the Submissions section for more details and examples.
Scoring
There are two main scores:
- Correlation (
CORR
): Your prediction's correlation to the target - True contribution (
TC
): Your prediction's contribution to the hedge fund's returns
Here are the CORR
and TC
scores of our example model over the past 1 year of submissions.
https://numer.ai/integration_test
See the Scoring section for more details.
Staking
You can stake NMR on your model to earn payouts based on performance.
Your payout is a primarily a function of your scores. If you have a positive score you will get a payout. If you have a negative score a portion of your stake will burn.
The maximum payout or burn per round is capped at ±5%
payout = stake * clip(payout_factor * (corr * corr_mult + tc * tc_mult), -0.05, 0.05)
stake
is the your model's stake value at theclose
of the roundpayout_factor
is a dynamic value that scales inversely with total NMR stakedcorr_mult
andtc_mult
are configured by you to control your exposure to each score
See the Staking section for more details.
Leaderboard
The 1 year average score is also called reputation
and your rank on the leaderboard is based on your model's 1 year average TC
score.
numer.ai/leaderboard
Support
Find us on Discord for questions, support, and feedback!