IntelPython/mkl_fft

fftn on multiple arrays faster using python loop

mvandeput opened this issue · 2 comments

Running mkl_fft.fftn on an array with shape (70, 196, 150, 24) only on the last 3 axes is more than 2 times slower than running the transform on each of the 70 sub-arrays individually using a python loop.

This is unexpected as one would assume the loop should actually run faster inside of mkl.

Simple example:

import mkl_fft
import time

shape = (70, 196, 150, 24)
array = np.random.random(shape) + 1j * np.random.random(shape)

def transform(array):
    result = np.empty(array.shape, dtype=complex)
    for ii, arr in enumerate(array):
        result[ii] = mkl_fft.fftn(arr)
    return result

t0 = time.time()
a = mkl_fft.fftn(array, axes=(1, 2, 3))
t1 = time.time()
b = transform(array)
t2 = time.time()

print('fftn on full array: {:.0f} ms'.format(1000*(t1 - t0)))
print('loop of fftn on subarray: {:.0f} ms'.format(1000*(t2 - t1)))
print(np.allclose(a, b))

On my machine this returns:

fftn on full array: 1359 ms
loop of fftn on subarray: 619 ms
True

I also verified the timings with timeit instead of time.time.

@mvandeput Thank you for the report. It is shortcoming of mkl_fft implementation of fftn. If the fftn call is not a transform of the full array, it simple mindedly uses iteration of 1D transforms.

One can write a function that iteratively calls fftn on subslices indicated by axes.

import numpy as np

def flat_to_multi(ind, shape):
    """Maps flat iter index into multi_index"""
    nd = len(shape)
    m_ind = [-1] * nd
    j = ind
    for i in range(nd):
        si = shape[nd-1-i]
        q = j // si
        r = j - si * q
        m_ind[nd-1-i] = r
        j = q
    return m_ind

def iter_complementary(x, axes, func, kwargs, result):
    if axes is None:
        return func(x, **kwargs)
    x_shape = x.shape
    nd = x.ndim
    r = list(range(nd))
    sl = [slice(None, None, None)] * nd
    if not isinstance(axes, tuple):
        axes = (axes,)
    for ai in axes:
        r[ai] = None
    size = 1
    sub_shape = []
    dual_ind = []
    for ri in r:
        if ri is not None:
            size *= x_shape[ri]
            sub_shape.append(x_shape[ri])
            dual_ind.append(ri)

    for ind in range(size):
        m_ind = flat_to_multi(ind, sub_shape)
        for k1, k2 in zip(dual_ind, m_ind):
            sl[k1] = k2
        # N.B.: A copy is being made here, which can be avoided
        # mkl_fft could write directly into memory of result
        result[tuple(sl)] = func(x[tuple(sl)], **kwargs)

    return result

Then one can can speed-up the transform:

shape = (70, 196, 150, 24)
array = np.random.random(shape) + 1j * np.random.random(shape)

t0 = time.time()
res_direct = mkl_fft.fftn(array, axes=(1, 2, 3))
t1 = time.time()

t2 = time.time()
tmp = np.empty(array.shape, dtype='D')
res_new = iter_complementary(array, (1, 2, 3), mkl_fft.fftn, dict(), tmp)
t3 = time.time()

print((t1-t0, t3-t2, np.allclose(res_new, res_direct)))
# outputs (1.9125349521636963, 0.6240551471710205, True)

It could be done more efficiently in C, since one can allocate the result once, and have MKL write into the preallocated result array directly, thus avoiding superfluous copies.

It would be really nice if this logic could be added to irfftn_numpy as well...