probcomp/bayeslite

CORRELATION PVALUE has performance issue

Closed this issue · 0 comments

fsaad commented
  1. Running ESTIMATE CORRELATION FROM PAIRWISE VARIABLES OF p takes 3.4 seconds on a population with 100 vars.
  2. Running ESTIMATE CORRELATION PVALUE FROM PAIRWISE VARIABLES OF p has been running for over 10 minutes on the same table.

The reason is that the pvalue of the t statistic is estimated using Monte Carlo sampling, and a new rng is created per call.
https://github.com/probcomp/bayeslite/blob/master/src/stats.py#L123-L136

  • Replacing our local version of t_cdf with scipy.stats.pearsonr results in query 2 running in 12.4 seconds.

Consider caching the MC samples.