jmrichardson/tuneta

selection bias under multiple testing (SBuMT)

Reed-Schimmel opened this issue · 4 comments

If I understand correctly, feature selection is out of scope as there are many available and easily applied as a subsequent step. Tuneta does have a simple prune (feature selection) capability which removes indicators with a maximum correlation to each other.

I'll try to clarify. Given a large enough number of trials, the probability of the target metric being overfit to the data is almost guaranteed. (This is just my best understanding, could be wrong). So when searching for the best correlation value, does TuneTA deflate the score to adjust for multiple testing?

I'm getting this info from books talking about over-inflated Sharpe ratios via mass-backtesting. I do not know if this is 100% applicable to correlation metrics, but I think it is.

Thank you for the clarification. Agreed, if you just choose the best parameter set based on a number of trials you will almost certainly overfit. However, tuneta selects the best parameter set with this methodology:

  1. Each trial is placed in its nearest neighbor cluster based on its distance correlation to the target. The optimal number of clusters is determined using the elbow method. The cluster with the highest average correlation is selected with respect to its membership. In other words, a weighted score is used to select the cluster with highest correlation but also with the most trials.
  2. After the best correlation cluster is selected, the parameters of the trials within the cluster are also clustered. Again, the best cluster of indicator parameter(s) are selected with respect to its membership.
  3. Finally, the centered best trial is selected from the best parameter cluster.

Step 1 tries to ensures that there are many different parameter(s) sets that obtain the correlation in that specific cluster (highest correlation).
Step 2 clusters the parameters used in the cluster from step 1. This helps with selecting the parameter set that falls within the same cluster which achieved the high correlation.
Step 3 selects the trial that is in the center of cluster from step 2.

So hopefully you can see that adding more trials actually helps with creating clusters that are within close proximity thereby avoiding the lucky guesses of parameters. The trade off here is that the highest correlation cluster often has less correlation than many of the original trials. The goal is choosing robust parameters that given minor changes of the parameters, they achieve roughly the same correlation as within the cluster.

BTW, I found this link of someone who used tuneta in their experiments that you may find useful:

https://www.finlab.tw/python-machine-learning-bitcoin-feature-engineering/

I don't know this person and to read the article you may need to translate to English but it gives a good review.