astrofrog/fast-histogram

GIL release

robbmcleod opened this issue · 8 comments

Nominally in C code that takes some time one can release the GIL with:

Py_BEGIN_ALLOW_THREADS
<Non-Python calling code>
Py_END_ALLOW_THREADS

It looks like it would be pretty trivial to enclose your for loops with these macros, is this something you're interested in?

@robbmcleod - good idea, do you want to give it a try and see if it leads to any improvements in performance?

Initially it doesn't help when I tried multi-threading with concurrent.futures as too much time is spent in the Python wrapper. It won't hurt anything though, and if the Python overhead can be reduced it will help anyone doing asycnhronous work (like me with a live-view in pyqtgraph).

Those np.ascontiguousarray calls are really quite expensive. What I can try then is only conditionally applying those calls by checking x.flags.c_continguous first. The vast majority of NumPy arrays are continguous. Similarly we don't want to cast. NumPy assumes you want a copy in those calls and you're not modifying the input array, so if it's not needed, it should be skipped.

Also we could explicitly create npy_float32 and npy_float64 branches. The easiest way to do this would be to have two copies of each function. I suspect that's not too much maintenance overhead? And then map everything from npy_int32 to npy_float64 and all the smaller dtypes to npy_float32.

Releasing the GIL isn't useful as the histogram calculation is too fast for context switches to be worthwhile. My experience with NumExpr is that even with p-threads you're looking at about 2-5 us per-thread to pass the thread barrier.

Bypassing the Python checks to call the C-API directly is the best speed-up I found:

Python call histogram average: 7.928e-06 +/- 4.195e-08 s
C-API call histogram f32 average: 5.724e-06 +/- 2.548e-08 s
C-API call histogram f64 average: 5.795e-06 +/- 4.494e-08 s

The only other thing that strikes me as a possible optimization would be SIMD vectorizing the loops. Neither MSVC or GCC are vectorizing the loops as they are now (which is sort of obvious both from the uncompelling float32 benchmarks and the compiler messages). The calls to ix = (int)((x[i] - xmin) * normx * fnx); could be vectorized, whereas the accumulations would have to be unrolled manually for the SIMD vector length. I don't know that it's worth the bother to implement.

Test code:

import numpy as np
import fast_histogram
from time import perf_counter as pc

tries = 50
stackDepth = 10000
nBins = 256
imSize = 4096
data = np.random.normal(size=(imSize**2)).reshape(imSize, imSize)
data32 = data.astype('float32')
minVal = data.min();  maxVal = data.max()
print(data.flags)

times_direct = np.zeros(tries)
times_direct32 = np.zeros(tries)
times_single = np.zeros(tries)
times_overhead = np.zeros(tries)
# Call C-API directly:
for I in range(tries):
    t0 = pc()
    for chunk in range(stackDepth):
        _ = fast_histogram._histogram_core._histogram1d(data, nBins, minVal, maxVal)
    times_direct[I] = pc() - t0

# Call C-API 32-bit version directly:
for I in range(tries):
    t0 = pc()
    for chunk in range(stackDepth):
        _ = fast_histogram._histogram_core._histogram1d_f32(data32, nBins, minVal, maxVal)
    times_direct32[I] = pc() - t0    

# Call Python API:
for I in range(tries):
    t0 = pc()
    for chunk in range(stackDepth):
        _ = fast_histogram.histogram1d(data, nBins, (minVal, maxVal))
    times_single[I] = pc() - t0

times_single /= stackDepth
times_overhead /= stackDepth
times_direct /= stackDepth
times_direct32 /= stackDepth

print( f'Python call histogram average: {times_single.mean():.3e} +/- {times_single.std():.3e} s' )
print( f'C-API call histogram f32 average: {times_direct32.mean():.3e} +/- {times_direct32.std():.3e} s')
print( f'C-API call histogram f64 average: {times_direct.mean():.3e} +/- {times_direct.std():.3e} s')

Changes can be seen at my fork:

https://github.com/robbmcleod/fast-histogram

Shall we close this?

@robbmcleod - thanks for investigating this! I'd like to keep this open as I want to take some time to over the changes and make sure there's nothing we want to include.

@robbmcleod - one of the changes here that does make a big difference for large arrays is the 32-bit version you implemented for 32-bit arrays (to avoid the increased memory usage), so I was wondering if you would be happy to open a pull request to add those changes in? (if you don't have time I can pull in the commits you made).

I'll take a look when I have time.

I'm going to close this since 32-bit floating point arrays should be faster now (#23)