inducer/loopy

Conditionals may include inappropriate parallel inames

kaushikcfd opened this issue · 2 comments

Consider the fairly straightforward transformation:

import loopy as lp
import numpy as np
import pyopencl as cl


t_unit = lp.make_kernel(
    "{[i,j,k]: 0<=i,j<72 and 0<=k<32}",
    """
    C[i,j] = sum(k, A[i,k] * B[k,j])
    """,
    [lp.GlobalArg("A,B", dtype=np.float64, shape=lp.auto),
     ...],
    lang_version=(2018, 2),
)
ref_t_unit = t_unit

Tx = 8
Ty = 23
Tk = 11

t_unit = lp.split_iname(t_unit, "i", Tx, inner_tag="l.0", outer_tag="g.0")
t_unit = lp.split_iname(t_unit, "j", Ty, inner_tag="l.1", outer_tag="g.1")
t_unit = lp.split_iname(t_unit, "k", Tk)
t_unit = lp.add_prefetch(
    t_unit, "A",
    sweep_inames=["i_inner", "k_inner"],
    temporary_address_space=lp.AddressSpace.LOCAL,
    fetch_outer_inames=frozenset({"i_outer", "j_outer", "k_outer"}),
    dim_arg_names=["iprftch_A", "kprftch_A"],
    default_tag=None,
)

t_unit = lp.add_prefetch(
    t_unit, "B",
    sweep_inames=["k_inner", "j_inner"],
    temporary_address_space=lp.AddressSpace.LOCAL,
    fetch_outer_inames=frozenset({"i_outer", "j_outer", "k_outer"}),
    dim_arg_names=["kprftch_B", "jprftch_B"],
    default_tag=None,
)

t_unit = lp.split_iname(t_unit, "kprftch_A", Tx, inner_tag="l.0")
t_unit = lp.split_iname(t_unit, "iprftch_A", Ty, inner_tag="l.1")
t_unit = lp.split_iname(t_unit, "jprftch_B", Tx, inner_tag="l.0")
t_unit = lp.split_iname(t_unit, "kprftch_B", Ty, inner_tag="l.1")

ctx = cl.create_some_context()
lp.auto_test_vs_ref(ref_t_unit, ctx, t_unit)

fails with

Traceback (most recent call last):
  File "/home/line/temp/loopy_mwe_for_multi_loops.py", line 49, in <module>
    lp.auto_test_vs_ref(ref_t_unit, ctx, t_unit)
  File "/home/line/projects/pytato_env/src/loopy/loopy/auto_test.py", line 602, in auto_test_vs_ref
    raise AutomaticTestFailure(error)
loopy.diagnostic.AutomaticTestFailure: results do not match -- (rel) l_2 err: 5.35702e-05, l_inf err: 1

I'm unsure whether the transformation is incorrect or loopy's code-generator is to be blamed.

After looking at this for a bit, this indeed looks like a loopy bug. The issue is that the domain contains multiple loops tagged with l.0, l.1 and that leads to incorrect loo- bound calculation. For the above example, loop-bounds for prefetching A should not depend on g.1, but the generated code disagrees.

I attempted to decouple this interference by introducing a transformation called decouple domains but even that does not work as CodeGenerationState.implemented_domains does not distinguish between non-interfering local hardware inames.

Conclusion: This transformation cannot be implemented in current-loopy, but the good thing is that these details are in the undefined realm of Loopy, so we just need to patch those definitions while accounting for such use-cases.

Thanks for finding this issue! I'm pretty surprised this hasn't bitten us sooner.