Implementation of code snippets and exercises from Machine Learning for Asset Managers (Elements in Quantitative Finance) written by Prof. Marcos López de Prado.
The project is for my own learning. If you want to use the consepts from the book - you should head over to Hudson & Thames. They have implemented these consepts and many more in mlfinlab. Edit: seems like some of theyr work - like jupyter notebooks - has gone behind a paywall.
For practical application see: Real world example.
Marcenko-Pasture theoretical probability density function, and empirical density function:
Figure 2.1:Marcenko-Pasture theoretical probability density function, and empirical density function: |
Denoising a random matrix with signal using the constant residual eigenvalue method. This is done by fixing random eigenvalues. See code snippet 2.5
Figure 2.2: A comparison of eigenvalues before and after applying the residual eigenvalue method: |
Detoned covariance matrix can be used to calculate minimum variance portfolio. The efficient frontier is the upper portion of the minimum variance frontier starting at the minimum variance portfolio. A denoised covariance matrix is less unstable to change.
Note: Excersize 2.7: "Extend function fitKDE in code snippet 2.2, so that it estimates through cross-validation the optimal value of bWidth (bandwidth)".
The script ch2_fitKDE_find_bandwidth.py implements this procedure and produces the (green) KDE in figure 2.3:
Figure 2.3: Calculated bandwidth(green line) together with histogram, and pdf. The green line is smoother. Bandwidth found: 0.03511191734215131 |
From code snippet 2.3 - with random matrix with signal: the histogram is how the eigenvalues of a random matrix with signal is distributed. Then the variance of the theoretical probability density function is calculated using the
Figure 2.4: histogram and pdf of eigenvalues with signal |
- definition of a metric:
- identity of indiscernibles d(x,y) = 0 => x=y
- Symmetry d(x,y) = d(y,x)
- triangle inequality.
- 1,2,3 => non-negativ, d(x,y) >= 0
- pearson correlation
- distance correlation
- angular distance
- Information-theoretic codependence/entropy dependence
- cross-entropy: H[X] = - Σs ∈ SX p[x] log (p[x])
- Kullback-Leilbler divergence: DKL[p||q] = - Σs ∈ SX p[x] log (q[x]/p[x]) = p[x] Σs ∈ S log (p[x]/q[x])
- Cross-entropy: Hc[p||q] = H[x] = DKL[p||q]
- Mutual information: Decrease in uncertainty in X from knowing Y: I[X,Y] = H[X] - H[X|Y] = H[X] + H[Y] - H[X,Y] = EX[DKL[p[y|x]||p[y]]]
- variation of information: VI[X,Y] = H[X|Y] + H[Y|X] = H[X,Y] - I[X,Y]. It is uncertainty we expect in one variable given another variable: VI[X,Y] = 0 <=> X=Y
- Kullback-Leilbler divergence is not a metric while variation of information is.
>>> ss.entropy([1./2,1./2], base=2)
1.0
>>> ss.entropy([1,0], base=2)
0.0
>>> ss.entropy([1./3,2./3], base=2)
0.9182958340544894
- 1 bit of information in coin toss
- 0 bit of information in deterministic outcome
- less than 1 bit of information in unfair coin toss
- Angular distance: p_d = sqrt(1/2 - (1-rho(X, Y)))
- Absolute angular distance: p_d = sqrt(1/2 - (1-|rho|(X, Y)))
- Squared angular distance: p_d = sqrt(1/2 - (1-rho^2(X, Y)))
Standard angular distance is better used for long-only portfolio appliacations. Squared and Absolute Angular Distances for long-short portfolios.
Use unsupervised learning to maximize intragroup similarities and minimize intergroup similarities. Consider matrix X of shape N x F. N objects and F features. Features are used to compute proximity(correlation, mutual information) to N objects in an NxN matrix.
There are 2 types of clustering algorithms. Partitional and hierarchical:
- Connectivity: hierarchical clustering
- Centroids: like k-means
- Distribution: gaussians
- Density: search for connected dense regions like DBSCAN, OPTICS
- Subspace: modeled on two dimension, feature and observation. Example
Generating of random block correlation matrices is used to simulate instruments with correlation. The utility for doing this is in code snippet 4.3, and it uses clustering algorithms optimal number of cluster (ONC) defined in snippet 4.1 and 4.2, which does not need a predefined number of clusters (unlike k-means), but uses an 'elbow method' to stop adding clusters. The optimal number of clusters are achived when there is high intra-cluster correlation and low inter-cluster correlation. The silhouette score is used to minimize within-group distance and maximize between-group distance.
- Fixed-Horizon method
- Time-bar method
- Volume-bar method
Tiple-Barrier Method involves holding a possition until
- Unrealized profit target achieved
- unrealized loss limit reached
- Position is held beyond a maximum number of bars
Trend-scanning method: the idea is to identify trends and let them run for as long and as far as they may persists, without setting any barriers.
Example of trend-scanning labels on sine wave with gaussian noise: |
trend-scanning with t-values which shows confidence in trend. 1 is high confidence going up and -1 is high confidence going down. |
"p-value does not measure the probability that neither the null nor the alternative hypothesis is true, or the significance of a result."
p-Values computed on a set of informative, redundant, and noisy explanatory variables. The explanatory variables has not the hightest p-values. |
The MDI algorith deals with 3 out of 4 problems with p-values:
- MDI is not imposing any tree structure, algebraic specification, or relying on any stocastic or distributional characteristics of the residuals (e.g. y=b0+b1*xi+ε)
- betas are estimated from single sample, MDI relies on bootstrapping, so the variance can be reduced by the numbers of trees in the random forrest ensemble.
- In MDI the goal is not to estimate a coefficient of a given algebraic equation (b_hat_0, b_hat_1) describing the probability of a null-hypotheses.
- MDI does not correct of calculation in-sample, as there is no cross-validation.
MDI algorithm example |
Figure 6.4 shows that ONC correctly recognizes that there are six relevant clusters(one cluster for each informative feature, plus one cluster of noise features), and it assigns the redundant features to the cluster that contains the informative feature from which the redundant features where derived. Given the low correlation across clusters, there is no need to replace the features with their residuals.
Next, apply the clustered MDI method to the clustered data:
Figure 6.5 Clustered MDI |
Clustered MDI works better han non-clustered MDI. Finally, apply the clustered MDA method to this data:
Figure 6.6 Clustered MDA |
Conclusion: C_5 which is accosiated with noisy features is not important, and all other clusteres has similar importance.
Convex portfolio optimization can calculate minimum variance portfolio and max sharp-ratio.
def Condition number: absolute value of the ratio between the maximum and minimum eigenvalues: A_n_n / A_m_m. The condition number says something about the instability of the instability caused by covariance structures. def trace = sum(diag(A)) - its the sum of the diagonal elements
Highly correlated time-series implie high condition number of the correlation matrix.
The correlation matrix C is stable only when the correlation
Hierarchical risk parity (HRP) outperforms Markowit in out-of-sample Monte-Carlo experimens, but is sub-optimal in-sample.
Code-snippet 7.1 illustrates the signal-induced instability of the correlation matrix.
>>> corr0 = mc.formBlockMatrix(2, 2, .5)
>>> corr0
array([[1. , 0.5, 0. , 0. ],
[0.5, 1. , 0. , 0. ],
[0. , 0. , 1. , 0.5],
[0. , 0. , 0.5, 1. ]])
>>> eVal, eVec = np.linalg.eigh(corr0)
>>> print(max(eVal)/min(eVal))
3.0
Figure 7.1 Heatmap of a block-diagonal correlation matrix |
Code-snippet 7.2 creates same block diagonal matrix but with one dominant block. However the condition number is the same.
>>> corr0 = block_diag(mc.formBlockMatrix(1,2, .5))
>>> corr1 = mc.formBlockMatrix(1,2, .0)
>>> corr0 = block_diag(corr0, corr1)
>>> corr0
array([[1. , 0.5, 0. , 0. ],
[0.5, 1. , 0. , 0. ],
[0. , 0. , 1. , 0. ],
[0. , 0. , 0. , 1. ]])
>>> eVal, eVec = np.linalg.eigh(corr0)
>>> matrix_condition_number = max(eVal)/min(eVal)
>>> print(matrix_condition_number)
3.0
This demonstrates bringing down the intrablock correlation in only one of the two blocks doesnt reduce the condition number. This shows that the instablility in Markowitz's solution can be traced back to the dominant blocks.
Figure 7.2 Heatmap of a dominant block-diagonal correlation matrix |
NCO provides a strategy for addressing the effect of Markowitz's curse on an existing mean-variance allocation method.
- step: cluster the correlation matrix
- step: compute optimal intracluster allocations, using the denoised covariance matrix
- step: compute optimal intercluster allocations, using the reduced covariance matrix which is close to a diagonal matrix, so optimization problem is close to ideal
markowitz case when
$\ro$ = 0
Backtesting is a historical simulation of how an investment strategy would have performed in the past. Backtesting suffers from selection bias under multiple testing, as researchers run millions of tests on historical data and presents the best ones (overfitted). This chapter studies how to measure the effect of selection bias.
Sharpe ratio = mu/sigma
A researcher may run many historical simulations and report only the best one (max sharp ratio). The distribution of max sharpe ratio is not the same as the expected sharpe ratio. Hence selection bias under multiple replications (SBuMT).
A monte carlo experiment shows that the distribution of the max sharp ratio increases (E[max(sharp_ratio)] = 3.26) even when the expected sharp ratio is 0 (E[sharp_ratio]). So an investment strategy will seem promissing even when there are no good strategy.
Either from resampling or monte carlo