Kernel.dispose always fail in openjdk12
AlexanderFedyukov opened this issue · 12 comments
Every call of Kernel.dispose fails in openjdk-12 and in openjdk-11 with error:
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f9c2c4481ac, pid=21273, tid=21274
#
# JRE version: OpenJDK Runtime Environment (12.0.2+9) (build 12.0.2+9)
# Java VM: OpenJDK 64-Bit Server VM (12.0.2+9, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0xc201ac] OopStorage::Block::release_entries(unsigned long, OopStorage*)+0x3c
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f8aebbacf4c, pid=23028, tid=23032
#
# JRE version: OpenJDK Runtime Environment (11.0.4+11) (build 11.0.4+11)
# Java VM: OpenJDK 64-Bit Server VM (11.0.4+11, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0xc25f4c] OopStorage::Block::release_entries(unsigned long, OopStorage::Block* volatile*)+0x3c
, in openjdk-8 works well.
@AlexanderFedyukov Please detail your full system configuration, Linux distribution, kernel version, libdrm version, mesa version, GPU. I've been using Ubuntu 18.04 LTS, kernel 4.15.0, libdrm 2.4.97, mesa with OpenJDK 11.0.4 and mesa 19.0.8 and kernel.dispose() causes no issue.
Detailed platform info I can gather later. But I suppose the reason of the issue is in code, I'll prepare and publish sample.
@CoreRasurae , can you check this sample from aparapi-examples. It falls with the same error too.
My system is Fedora 30 5.2.8-200.fc30.x86_64 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5) OpenJDK 64-Bit Server VM 19.3 (build 12.0.2+9, mixed mode, sharing)
@AlexanderFedyukov I tried the sample you provided and it does replicate the problem. It happens with all aparapi-jni that I've tried, 1.2.0, 1.3.1, 1.4.0, 1.4.1. I'll give it a look...
The root cause of this bug is caused by a leftover from original unfinished Aparapi sketch, that is, the Segmentation fault occurs when a multidimensional array is present (e.g. dimension > 1) and JNIContext::dispose() from aparapi-native is called.
When such happens JNIContext::dispose() will try to do:
jenv->DeleteWeakGlobalRef((jweak) arg->aparapiBuffer->javaObject);
but there is no corresponding call to NewWeakGlobalRef(...), because for 2D and 3D arrays there is no need to access data across JNI calls and thus no WeakGlobalRef is allocated in the first place, resulting in a free without allocate.
This was mimicking what is done with 1D arrays for arg->arrayBuffer->javaArray, except that 2D and 3D arrays are handled differently, since Java does not allocate contiguous memory for multidimensional arrays, so that Aparapi needs to handle them in a different manner.
The fix for this issue involves only aparapi-native.
Correction there are two different ways that allow the bug to be fixed:
a) Remove all references to Java Object, by making aparapi-native retrieve the current address of the buffer, which shouldn't change between execution() and result retrieval(), despite crossing more than one JNI call.
b) The one I ended up implementing: Make sure NewWeakGlobalRef(...) is called when a multidimensional array is provided as a Kernel argument for the Kernel, so that the original Array Java object address can be retrieved at a later time during a posterior JNI call.
Sorry, your solution is not clear for me. Can you clarify, does workaround exists?
@AlexanderFedyukov A new version of Aparapi JNI is on its way, which will solve the issue.
fixed
Good news! Thank you a lot!