scipy/scipy

stats.truncnorm.rvs 100x slower in v1.4.x than v1.3.3

martinforsythe opened this issue · 10 comments

The 1.4.x releases of scipy seem to have resulted in greater than 100x slow-down of stats.truncnorm.rvs sampling relative to the 1.3.x versions:

Reproducing code example:

With scipy==1.3.3

>>> pip install scipy==1.3.3
>>> python -m timeit "from scipy.stats import truncnorm; truncnorm.rvs(-2, 2, size=(50, 50, 50))"
1 loop, best of 5: 6.45 msec per loop

With scipy==1.4.0

>>> pip install scipy==1.4.0
>>> python -m timeit "from scipy.stats import truncnorm; truncnorm.rvs(-2, 2, size=(50, 50, 50))"
1 loop, best of 5: 885 msec per loop

Similar slowdown has also been observed with scipy==1.4.1.

Scipy/Numpy/Python version information:

To reproduce the slow sampling behavior:

1.4.0 1.16.4 sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)

To reproduce the fast sampling behavior:

1.3.3 1.16.4 sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)

Other details of my environment:
- python=3.7.3 (compiled with Clang 8.0.0)
- platform=darwin_x86_64
- cpu=Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz

Probably due to gh-10104. @pvanmulbregt is this expected?

The new version uses np.vectorize and looses standard vectorization AFAICS

It might be better to split the distribution into the standard usecase version (that prefers speed over good tail behavior) and a slow tail usecase version if the slowdown is widespread across methods.
#2477 (comment)

Apologies if this question should be obvious from the code-base, but are there currently any unit tests that check the speed of random sampling? I was a little incredulous at first when a colleague brought this issue to my attention because a 100x slowdown is pretty substantial.

I'm not aware of any unit tests that test for speed. Scipy does have an airspeedvelocity framework that is run on a semi regular basis, see https://pv.github.io/scipy-bench/, but it's not comprehensive by any means. A test for any speed regression could be put in there, but I'm not sure what the formality is for noticing that there are negative changes. This issue is certainly critical for those who want to do calculations with large amounts of random numbers.

IIRC the issue was that truncnorm was broken and was therefore rewritten completely. That's a case where you review for correctness rather than speed (at least, that's what I did). Would be good to look into whether the vectorized approach can still be used - I suspect np.vectorize was used for a good reason.

For my specific user case I'm interested in large numbers of random variates from truncnorm with tails chopped off, i.e. tail behaviour isn't important. Thus, a two state behaviour, as suggested by @josef-pkt, would be useful.

Some comments:
I believe that I've slowed down truncnorm twice in the past year, though each time it was a known side-effect, rather than the goal, of the change.
gh-9900 fixed thread-safety issues arising from distributions setting member variables inside _argcheck().
gh-10104 was a rewrite to fix some asymmetry and overflow issues.
Both impact the speed difference observed here.
truncnorm needs to determine the probability in the truncated tails. Prior to gh-9900, truncnorm calculated this once and stored it in the class. It used scipy.special.ndtr to do that.
After gh-9900, the result is not stored but computed each time. It still uses scipy.special.ndtr if the tail endpoints are in the range [-40, 40], otherwise it does a calculation using logs.
For the ppfs, the calculation may be able to use scipy.special.ndtri as before, or it may have to do a root solving operation.
One other slowdown is by using np.vectorize. IIRC, this turned out to be needed because during a call to _ppf(a, b, x), the arguments a, b and c may still be arrays. I.e. It is not the case that dist.ppf(a, b, q) handles the arrays, and dist._ppf(a, b, q) computes the scalar inverse ppf. Instead _ppf() may have to do both. It is not arrays of q that are the issue, it is arrays of a and/or 'b'.

In the case above, for a,b = (-2, 2), the new code should be using the same underlying scipy.special calls, but it has to compute the probability in the tails for every value of q.

Freezing the distribution doesn't help either as that doesn't allow for the precomputation of the tail probabilities, it merely remembers the a,b parameters.

To be a bit more explicit on the need for the np.vectorize:
The call stats.truncnorm.rvs([-10, 10], [-5, 11], size=(5, 2)) generates a call truncnorm._ppf(U, a, b) where U has size=(5, 2) and contains a random samples from Uniform(0, 1). E.g.

U= [[0.19151945 0.62210877]
    [0.43772774 0.78535858]
    [0.77997581 0.27259261]
    [0.27646426 0.80187218]
    [0.95813935 0.87593263]]
a = [-10 10]
b = [-5 11]

and is expected to return a (5, 2) array such as

[[-5.30971183 10.09591425]
 [-5.15699038 10.15125262]
 [-5.04769914 10.03146819]
 [-5.24240404 10.15906068]
 [-5.00823854 10.20459498]]

truncnorm._ppf calls _truncnorm_ppf to do the work. The np.vectorize wraps _truncnorm_ppf and turns this one call with arrays into 5*2=10 calls to _truncnorm_ppf(u_, a_, b_), passing in scalars.
Is there a standard way to turn this into just 2 calls, one call for a_,b_=-10, -5 with the first column of U, and one for a_,b_=10, 11, with the 2nd column of U. E.g.

_truncnorm_ppf([0.19151945, 0.43772774, 0.77997581, 0.27646426, 0.95813935], a=-10, b=-5)
_truncnorm_ppf([0.62210877, 0.78535858, 0.27259261, 0.80187218, 0.87593263], a=10, b=11)

That would enable computing the tail probabilities just once for each instance of distribution shape values.

Added links to gh-2069 and gh-5526 which have relevant discussions.

im still getting crazy slowdowns using the new truncnorm from v1.5.0rc1. A sampler that uses the truncnorm in a joblib.Parallel process takes over 50min to produce 2000 samples whereas in v1.2.1 it took just over 1min @pvanmulbregt