CORRELATION PVALUE has performance issue
Closed this issue · 0 comments
fsaad commented
- Running
ESTIMATE CORRELATION FROM PAIRWISE VARIABLES OF p
takes 3.4 seconds on a population with 100 vars. - Running
ESTIMATE CORRELATION PVALUE FROM PAIRWISE VARIABLES OF p
has been running for over 10 minutes on the same table.
The reason is that the pvalue of the t statistic is estimated using Monte Carlo sampling, and a new rng is created per call.
https://github.com/probcomp/bayeslite/blob/master/src/stats.py#L123-L136
- Replacing our local version of t_cdf with scipy.stats.pearsonr results in query 2 running in 12.4 seconds.
Consider caching the MC samples.