facvae
How to predict stock returns with cross-sectional information considered but also freeing the size of the stocks? Here's a PyTorch inplementation of FactorVAE with some improvements, refering to "FactorVAE: A Probabilistic Dynamic Factor Model Based on Variational Autoencoder for Predicting Cross-Sectional Stock Returns"
1. Introduction
1.1. Abstract
As an asset pricing model in economics and finance, factor model has been widely used in quantitative investment. Towards building more effective factor models, recent years have witnessed the paradigm shift from linear models to more flexible nonlinear data-driven machine learning models. However, due to low signal-to-noise ratio of the financial data, it is quite challenging to learn effective factor models. In this paper, we propose a novel factor model, FactorVAE, as a probabilistic model with inherent randomness for noise modeling. Essentially, our model integrates the dynamic factor model (DFM) with the variational autoencoder (VAE) in machine learning, and we propose a prior-posterior learning method based on VAE, which can effectively guide the learning of model by approximating an optimal posterior factor model with future information. Particularly, considering that risk modeling is important for the noisy stock data, FactorVAE can estimate the variances from the distribution over the latent space of VAE, in addition to predicting returns. The experiments on the real stock market data demonstrate the effectiveness of FactorVAE, which outperforms various baseline methods.
1.2. Visualization
1.2.1. Brief illustration
1.2.1. Overall framework
1.2.2. Encoder-decoder architecture
1.2.3. Factor predictor with multi-head global attention mechanism
2. Notation
2.1. Scalar (constant)
E
: size of epochs (arbitrary)B
: size of batches (arbitrary)N
: size of stocks (arbitrary)T
: size of time periods (arbitrary)C
: size of characteristicsH
: size of hidden featuresM
: size of portfoliosK
: size of factors
2.2. Tensor (variable)
x
: characteristics, B*N*T*Cy
: stock returns, B*Ne
: hidden features, B*N*Hy_p
: portfolio returns, B*Mz_post
: posterior latent factor returns, B*Kz_prior
: prior latent factor returns, B*Kalpha
: idiosyncratic returns, B*Nbeta
: factor exposures, B*N*Ky_hat
: reconstructed stock returns, B*Nmu_post
: mean vector ofz_post
, B*Ksigma_post
: std vector ofz_post
, B*Kmu_prior
: mean vector ofz_prior
, B*Ksigma_prior
: std vector ofz_prior
, B*Kmu_alpha
: mean vector ofalpha
, B*Nsigma_alpha
: std vector ofalpha
, B*Nmu_y
: mean vector ofy_hat
, B*NSigma_y
: cov matrix ofy_hat
, B*N*N
2.3. Distribution
-
$p_{\theta}(y|x)$ : true label, likelihood -
$q_{\phi}(z|x,y)$ : encoder output, posterior distribution -
$q_{\phi}(z|x)$ : predictor output, prior distribution -
$p_{\theta}(y|x,z)$ : decoder output, conditional likelihood -
$f_{\phi,\theta}(y|x)$ : predicted label, predicted likelihood
3. Module
3.1. __init__.py
-
FactorVAE
(top-level encapsulated class) extracts effective factors from noisy market data. First, it obtain optimal factors by an encoder-decoder architecture with access to future data, and then train a factor predictor according a prior-posterior learning method, which extracts factors to approximate the optimal factors. -
PipelineFactorVAE
as a subclass ofPipeline
, automates the training, validation, and testing process of theFactorVAE
. -
loss_fn_vae()
gets the loss value of the model. -
bcorr()
calculates batch correlation between two vectors. -
gaussian_kld()
calculates KL divergence of two multivariate independent Gaussian distributions.
3.2. data.py
-
RollingDataset
yields characteristicsx
in R^{N*T*C}, and future stock returnsy
in R^{N} in each iteration. -
change_freq()
changes the frequency of the panel data. -
shift_ret()
shifts returns to the previous period then drop NaN. -
wins_ret()
winsorizes returns. -
assign_label()
assigns labels based on different quantiles of the returns.
3.3. pipeline.py
-
Pipeline
gives a general machine learning pipeline which automates the model training, validation, and testing process. -
set_seeds()
sets random seeds for all random processes.
3.4. backtesting.py
-
Backtester
backtests cross-sectinal strategies, by the following procedure: 1)factor
$\rightarrow$ pos
; 2)pos
+ret
$\rightarrow$ strat_ret
; 3)strat_ret
$\rightarrow$ nv
.
3.5. feature_extractor.py
-
FeatureExtractor
extracts stocks hidden featurese
from the historical sequential characteristicsx
.
3.6. factor_encoder.py
-
FactorEncoder
extracts posterior factorsz_post
, a random vector following the independent Gaussian distribution, which can be described by the meanmu_post
and the standard deviationsigma_post
, from hidden featurese
and stock returnsy
. -
PortfolioLayer
dynamically re-weights the portfolios on the basis of stock hidden featurese
. -
MappingLayer
mapsy_p
as the portfolio returns to the distribution of posterior factor returnsz_post
.
3.7. factor_decoder.py
-
FactorDecoder
calculates predicted stock returnsy_hat
, a random vector following the Gaussian distribution, which can be described by the meanmu_y
and the covariance matrixSigma_y
, from distribution parameters of factor returnsz
(could bez_post
orz_prior
) and hidden featurese
. -
AlphaLayer
outputs idiosyncratic returnsalpha
from the hidden featurese
. -
BetaLayer
calculates factor exposuresbeta
from hidden feautrese
.
3.8. factor_predictor.py
-
FactorPredictor
extracts prior factor returnsz_prior
, a random vector following the independent Gaussian distribution, which can be described by the meanmu_prior
and the standard deviationsigma_prior
, from hidden featurese
. -
MultiheadGlobalAttention
implements a specific type of multi-head global attention.
4. Example
import numpy as np
import pandas as pd
import torch
from facvae import FactorVAE, PipelineFactorVAE
from facvae.backtesting import Backtester
from facvae.data import RollingDataset, change_freq, shift_ret, wins_ret
from facvae.pipeline import set_seeds
from torch.utils.data import DataLoader
if __name__ == "__main__":
# constants
E = 20
B = 32
N = 74
T = 5
C = 28
H = 16
M = 24
K = 8
h_prior_size = 32
h_alpha_size = 16
h_prior_size = 16
partition = [0.8, 0.1, 0.1]
lr = 0.01
gamma = 2.0
lmd = 0.5
max_grad = None
freq = "d"
start_date = "2015-01-01"
end_date = "2023-01-01"
top_pct = 0.1
wins_thresh = 0.25
verbose_freq = None
# data
df = pd.read_pickle("df.pickle")
df = df.loc[start_date:end_date]
df = change_freq(df, freq)
df = shift_ret(df)
df = wins_ret(df, wins_thresh)
# pipeline
ds = RollingDataset(df, "ret", T)
loss_kwargs = {"gamma": gamma, "lmd": lmd}
eval_kwargs = {"df": df, "top_pct": top_pct}
pl = PipelineFactorVAE(ds, partition, B, loss_kwargs, eval_kwargs)
# search
for i in range(2000):
set_seeds(i)
print("seed:", i)
fv = FactorVAE(C, H, M, K, h_prior_size, h_alpha_size, h_prior_size)
pl.train(fv, lr, E, max_grad, verbose_freq=verbose_freq)
sr_valid = pl.validate(fv)
sr_test = pl.test(fv)
print(sr_valid)
print(sr_test)
if sr_valid > 1.5 and sr_test > 1.5:
torch.save(fv, dir_result + f"model_{i}")
# check
model = torch.load(dir_result + "model_xxx")
dl = DataLoader(ds, len(ds))
x, y = next(iter(dl))
mu_y, Sigma_y = model.predict(x)
mu_y = mu_y.flatten().cpu().numpy()
df["factor"] = np.nan
df.iloc[-len(mu_y):, -1] = mu_y
print(df)
bt = Backtester("factor", top_pct=top_pct).feed(df).run()
bt.report()