GridTools/gt4py

over-sized allocation of arrays for dace backend

xyuan opened this issue · 1 comments

The original issue comes from the backend c++ code, see for example,

__state->__0_w1 = new double DACE_ALIGN(64)[((((__J * __K) * (__I - 1)) + (__K * (__J - 1))) + __K)];
__state->__0_g_rat = new double DACE_ALIGN(64)[((((__J * __K) * (__I - 1)) + (__K * (__J - 1))) + __K)];
__state->__0_bb = new double DACE_ALIGN(64)[((((__J * __K) * (__I - 1)) + (__K * (__J - 1))) + __K)];
__state->__0_dd = new double DACE_ALIGN(64)[((((__J * __K) * (__I - 1)) + (__K * (__J - 1))) + __K)];
__state->__0_bet = new double DACE_ALIGN(64)[((((__J * __K) * (__I - 1)) + (__K * (__J - 1))) + __K)];
__state->__0_pp = new double DACE_ALIGN(64)[((((__J * __K) * (__I - 1)) + (__K * (__J - 1))) + __K)];
__state->__0_gam = new double DACE_ALIGN(64)[((((__J * __K) * (__I - 1)) + (__K * (__J - 1))) + __K)];
__state->__0_aa = new double DACE_ALIGN(64)[((((__J * __K) * (__I - 1)) + (__K * (__J - 1))) + __K)];
__state->__0_p1 = new double DACE_ALIGN(64)[((((__J * __K) * (__I - 1)) + (__K * (__J - 1))) + __K)];

this bug leads to allocate a larger chunck of memory for array with size of (__I, __J, __K), here __I and __J are the horizontal, and __K is the vertical, when we run large grid size problem, it leads to very poor performance.

the root of this problem comes from the dace data array initialization, which takes shape, strides, and total_size as argument, however, if no total_size is provided, the code calculates the total size of array (see the following):

    if strides is not None and shape is not None and total_size is None:
        # Compute the minimal total_size that could be used with strides and shape
        self.total_size = sum(((shp - 1) * s for shp, s in zip(shape, strides))) + 1
    else:
        self.total_size = total_size or _prod(shape)

To fix this issue, we have to provide the total_size when we call sdfg.add_array function, or we make changes to the data::array constructor. Here is the generated c++ code with this fix,

__state->__0_w1 = new double DACE_ALIGN(64)[((__I * __J) * __K)];
__state->__0_g_rat = new double DACE_ALIGN(64)[((__I * __J) * __K)];
__state->__0_bb = new double DACE_ALIGN(64)[((__I * __J) * __K)];
__state->__0_dd = new double DACE_ALIGN(64)[((__I * __J) * __K)];
__state->__0_bet = new double DACE_ALIGN(64)[((__I * __J) * __K)];
__state->__0_pp = new double DACE_ALIGN(64)[((__I * __J) * __K)];
__state->__0_gam = new double DACE_ALIGN(64)[((__I * __J) * __K)];
__state->__0_aa = new double DACE_ALIGN(64)[((__I * __J) * __K)];
__state->__0_p1 = new double DACE_ALIGN(64)[((__I * __J) * __K)];

After fixing this bug, it gives about 20% performance increasement on Frontier CPU run.