Does "local_accessor" really have common reference semantics?
gmlueck opened this issue · 1 comments
We state in section 4.5.2 that local_accessor
has common reference semantics, which means that a copy of a local_accessor
object:
must behave as-if it were the original instance and as-if any action performed on it were also performed on the original instance
If that were true, I would expect two local_accessor
objects that are copies to reference the same underlying local memory. This is not how DPC++ behaves currently, and I wonder if we really intended this to be the case. Consider the following test:
#include <iostream>
#include <sycl/sycl.hpp>
static constexpr size_t SIZE = 16;
static constexpr size_t WGSIZE = 4;
int main() {
sycl::queue q;
sycl::buffer<size_t> buf{{SIZE}};
q.submit([&](sycl::handler &cgh) {
sycl::accessor acc{buf, cgh};
sycl::local_accessor<size_t> lacc1{{WGSIZE}, cgh};
#ifdef COPY_IN_HOST
sycl::local_accessor<size_t> lacc2{lacc1};
#endif
sycl::nd_range ndr{{SIZE}, {WGSIZE}};
cgh.parallel_for(ndr, [=](sycl::nd_item<> it) {
#ifndef COPY_IN_HOST
sycl::local_accessor<size_t> lacc2{lacc1};
#endif
size_t lid = it.get_local_linear_id();
size_t gid = it.get_global_linear_id();
lacc1[lid] = gid;
// Should this work? If "lacc1" and "lacc2" have common reference
// semantics, then they should both point to the same memory, and you
// would expect this to read the same memory as "lacc1[lid]".
acc[gid] = lacc2[lid];
});
});
bool ok = true;
sycl::host_accessor ha{buf};
for (int i = 0; i < SIZE; i++) {
if (ha[i] != i) {
std::cout << "Error on element " << i << ": " << ha[i] << "\n";
ok = false;
}
}
if (ok) {
std::cout << "Calculation OK\n";
}
return (ok) ? 0 : 1;
}
When the local_accessor
object is copied in host code (command group scope), the test fails:
$ clang++ -fsycl -DCOPY_IN_HOST -o local-accessor-copy local-accessor-copy.cpp
$ ./local-accessor-copy
Error on element 1: 0
Error on element 2: 0
Error on element 3: 0
Error on element 4: 0
Error on element 5: 0
Error on element 6: 0
Error on element 7: 0
Error on element 8: 0
Error on element 9: 0
Error on element 10: 0
Error on element 11: 0
Error on element 12: 0
Error on element 13: 0
Error on element 14: 0
Error on element 15: 0
However, when the local_accessor
object is copied inside the kernel, the test passes:
$ clang++ -fsycl -UCOPY_IN_HOST -o local-accessor-copy local-accessor-copy.cpp
$ ./local-accessor-copy
Calculation OK
When you understand the implementation, this behavior makes sense. Each local accessor kernel argument creates its own unique SLM pointer inside the kernel, thus the test fails when the copy happens on the host. However, inside the kernel, the copy just copies the underlying SLM pointer, so the copies alias the same memory. Thus, the test passes when the copy happens inside the kernel.
However, the spec doesn't say this is the expected behavior. Furthermore, this behavior likely seems bizarre to someone who does not understand the implementation.
It would be tempting to sidestep the issue and just delete the copy constructor and copy assignment operator from local_accessor
. However, I fear that existing applications may pass local_accessor
by value to function calls, and this wouldn't work if we delete these.
Note that there is no similar issue with regular accessor
. When you copy an accessor
object on the host, the copy points to the same buffer, and the common reference semantics are upheld.
If that were true, I would expect two local_accessor objects that are copies to reference the same underlying local memory. This is not how DPC++ behaves currently, and I wonder if we really intended this to be the case. Consider the following test:
I did not yet run your test case, but my expectation that AdaptiveCpp honors the reference semantics in that case. So I believe that common reference semantics are implementable here. What AdaptiveCpp does is that it manages a single local memory allocation for all local accessors, and the local accessor only stores offsets into that memory. So when you copy an accessor, only the offset is copied, but the offset is still applied to the same local memory allocation.
EDIT: Yes, works with AdaptiveCpp
$ acpp -o la -DCOPY_IN_HOST la.cpp
acpp warning: No optimization flag was given, optimizations are disabled by default. Performance may be degraded. Compile with e.g. -O2/-O3 to enable optimizations.
$ ACPP_VISIBILITY_MASK="omp;cuda" ./la
[AdaptiveCpp Warning] dag_direct_scheduler: Detected a requirement that is neither of discard access mode (SYCL 1.2.1) nor no_init property (SYCL 2020) that accesses uninitialized data. Consider changing to discard/no_init. Optimizing potential data transfers away.
'+ptx86' is not a recognized feature for this target (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
[AdaptiveCpp Warning] kernel_cache: This application run has resulted in new binaries being JIT-compiled. This indicates that the runtime optimization process has not yet reached peak performance. You may want to run the application again until this warning no longer appears to achieve optimal performance.
Calculation OK
$ acpp -o la -UCOPY_IN_HOST la.cpp
acpp warning: No optimization flag was given, optimizations are disabled by default. Performance may be degraded. Compile with e.g. -O2/-O3 to enable optimizations.
$ ACPP_VISIBILITY_MASK="omp;cuda" ./la
[AdaptiveCpp Warning] dag_direct_scheduler: Detected a requirement that is neither of discard access mode (SYCL 1.2.1) nor no_init property (SYCL 2020) that accesses uninitialized data. Consider changing to discard/no_init. Optimizing potential data transfers away.
'+ptx86' is not a recognized feature for this target (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
[AdaptiveCpp Warning] kernel_cache: This application run has resulted in new binaries being JIT-compiled. This indicates that the runtime optimization process has not yet reached peak performance. You may want to run the application again until this warning no longer appears to achieve optimal performance.
Calculation OK