Conditionals may include inappropriate parallel inames
kaushikcfd opened this issue · 2 comments
Consider the fairly straightforward transformation:
import loopy as lp
import numpy as np
import pyopencl as cl
t_unit = lp.make_kernel(
"{[i,j,k]: 0<=i,j<72 and 0<=k<32}",
"""
C[i,j] = sum(k, A[i,k] * B[k,j])
""",
[lp.GlobalArg("A,B", dtype=np.float64, shape=lp.auto),
...],
lang_version=(2018, 2),
)
ref_t_unit = t_unit
Tx = 8
Ty = 23
Tk = 11
t_unit = lp.split_iname(t_unit, "i", Tx, inner_tag="l.0", outer_tag="g.0")
t_unit = lp.split_iname(t_unit, "j", Ty, inner_tag="l.1", outer_tag="g.1")
t_unit = lp.split_iname(t_unit, "k", Tk)
t_unit = lp.add_prefetch(
t_unit, "A",
sweep_inames=["i_inner", "k_inner"],
temporary_address_space=lp.AddressSpace.LOCAL,
fetch_outer_inames=frozenset({"i_outer", "j_outer", "k_outer"}),
dim_arg_names=["iprftch_A", "kprftch_A"],
default_tag=None,
)
t_unit = lp.add_prefetch(
t_unit, "B",
sweep_inames=["k_inner", "j_inner"],
temporary_address_space=lp.AddressSpace.LOCAL,
fetch_outer_inames=frozenset({"i_outer", "j_outer", "k_outer"}),
dim_arg_names=["kprftch_B", "jprftch_B"],
default_tag=None,
)
t_unit = lp.split_iname(t_unit, "kprftch_A", Tx, inner_tag="l.0")
t_unit = lp.split_iname(t_unit, "iprftch_A", Ty, inner_tag="l.1")
t_unit = lp.split_iname(t_unit, "jprftch_B", Tx, inner_tag="l.0")
t_unit = lp.split_iname(t_unit, "kprftch_B", Ty, inner_tag="l.1")
ctx = cl.create_some_context()
lp.auto_test_vs_ref(ref_t_unit, ctx, t_unit)
fails with
Traceback (most recent call last):
File "/home/line/temp/loopy_mwe_for_multi_loops.py", line 49, in <module>
lp.auto_test_vs_ref(ref_t_unit, ctx, t_unit)
File "/home/line/projects/pytato_env/src/loopy/loopy/auto_test.py", line 602, in auto_test_vs_ref
raise AutomaticTestFailure(error)
loopy.diagnostic.AutomaticTestFailure: results do not match -- (rel) l_2 err: 5.35702e-05, l_inf err: 1
I'm unsure whether the transformation is incorrect or loopy's code-generator is to be blamed.
After looking at this for a bit, this indeed looks like a loopy bug. The issue is that the domain contains multiple loops tagged with l.0
, l.1
and that leads to incorrect loo- bound calculation. For the above example, loop-bounds for prefetching A
should not depend on g.1
, but the generated code disagrees.
I attempted to decouple this interference by introducing a transformation called decouple domains but even that does not work as CodeGenerationState.implemented_domains
does not distinguish between non-interfering local hardware inames.
Conclusion: This transformation cannot be implemented in current-loopy, but the good thing is that these details are in the undefined realm of Loopy, so we just need to patch those definitions while accounting for such use-cases.
Thanks for finding this issue! I'm pretty surprised this hasn't bitten us sooner.