causality-benchmark-data: A Jupyter Notebook repository from Causality

Causality Benchmark Data

Showcasing Causality Group's benchmark data through a data loading library and a signal backtesting example.

Report Bug · Request Feature

Table of Contents

About the Project
- Built With
Getting Started
- Prerequisites
- Installation
Backtesting and Data Layout
License
Contact

About the Project

Have you ever found yourself struggling to prepare clean financial data for analysis or to align data from various sources?

With this repository, you can explore Causality Group's curated historical dataset for academic and non-commercial use, covering the 1500 most liquid stocks in the US equities markets.

Features include:

Liquid universe of 1500 stocks, updated monthly
Free from survivorship bias
Daily Open, High, Low, Close, VWAP, and Volume
Overnight returns adjusted for splits, dividends, mergers, and acquisitions
Intraday 5-minute VWAP, spread, and volume snapshots
SPY ETF data for hedging
CAPM betas and residuals for market-neutral analysis

Please contact us on LinkedIn for access to the dataset!

More details here

(back to top)

Built With

(back to top)

Getting Started

Follow these steps to set up the project on your local machine for development and testing purposes.

Prerequisites

Ensure you have the following installed on your local setup:

Python 3.9.5
Poetry (see installation instructions)

Installation

Clone the repository.
Install the dependencies:

poetry install

Optional: If you want to use the Jupyter kernel, install the optional jupyter group of dependencies with poetry install --with jupyter.

Install the pre-commit hooks:

poetry run pre-commit install

You're all set! Pre-commit hooks will run on git commit (more information in pre-commit docs). Ensure your changes pass all checks before pushing.

Available Scripts

poetry run black ./causalitydata: Runs the code formatter.
poetry run pylint ./causalitydata: Runs the linter.
poetry run install-ipykernel: Installs the causality kernel for Jupyter.
poetry run uninstall-ipykernel: Uninstalls the causality kernel for Jupyter.

Note: To run the ipykernel scripts you need to install the optional jupyter group of dependencies. Use poetry install --with jupyter.

(back to top)

Backtesting and Data Layout

Backtesting

01-Backtesting-Signals.ipynb serves as a minimal example of utilizing the dataset and library for quantitative analysis, alpha signal research, and backtesting.

The example showcases a daily backtest, relying on close-to-close adjusted returns of the 1500 most liquid companies in the US since 2007. Since the most liquid companies change constantly, we update our liquid universe at the start of each month. This dynamic universe is already pre-calculated in the universe.csv data file.

Assuming trading at the 16:00 close auction in the US, our example only uses features for alpha creation that are observable by 15:45. We plot the performance of some well-known alpha factors and invite you to experiment with building your quantitative investment model from there!

Data Layout

All data files in the benchmark dataset have the same structure:

Data files are in .csv format.
The first row contains the header.
Rows represent different dates in increasing order. There is only one row per date, i.e., there is no intraday granularity within the files.
The first column corresponds to the index and contains the date information, at which the given value is observable:
- Date format: YYYY-MM-DD.
Every other column represents an individual asset in the universe:
- Asset identifier format: <ticker>_<exchange>_<CFI>. E.g., AAPL_XNAS_ESXXXX.
All files have the same number of rows and columns.

There are two types of files in the dataset, daily and intraday. Daily files contain data whose characteristic is that there can only be one datapoint per day, e.g., open auction price, daily volume, GICS sector information, etc. Intraday files contain information about the market movements during the US trading session, e.g., intraday prices and volumes. We accumulate this data in 5-minute bars. The name of intraday files starts with an integer identifying the bar time.

File Description

Here we detail the data contained in some files that might not be trivial by their name.

Daily
- universe.csv: Mask of the tradable universe at each date. The universe is rebalanced at the beginning of each month.
- ret_<cc, co, oc, oo>.csv: Adjusted asset returns calculated on different time periods:
  - cc: Close-to-Close, the position is entered at the close auction and exited at the following day's close auction.
  - co: Close-to-Open, the position is entered at the close auction and exited at the following day's open auction.
  - oc: Open-to-Close, the position is entered at the open auction and exited at the same day's close auction.
  - oo: Open-to-Open, the position is entered at the open auction and exited at the following day's open auction.
- SPY_ret_<cc, co, oc, oo>.csv: SPY ETF return. The SPY time series is placed in all asset columns for convenience.
- beta_<cc, co, oc, oo>.csv: CAPM betas between assets and the SPY ETF for different time periods.
- resid_<cc, co, oc, oo>.csv: CAPM residual returns on different time periods. resid = ret - beta * SPY_ret
Intraday
- <hhmmss>_<close, cost, return, volume, vwas, vwap>_5m.csv: Intraday market data snapshots at hhmmss bar. These backward-looking bars are calculated on the time range [t-5min, t).

(back to top)

License

Distributed under the BSD-3 License. See LICENSE for more information.

(back to top)

Contact

Please reach us at LinkedIn or visit our website!

(back to top)

causality-group/causality-benchmark-data

Causality Benchmark Data

About the Project

Built With

Getting Started

Prerequisites

Installation

Available Scripts

Backtesting and Data Layout

Backtesting

Data Layout

File Description

License

Contact