openxla/xla

Panic when do custom_call for gpu cuda

Closed this issue · 1 comments

Dear xla team,
I integrated a CUDA custom-call operator in JAX. During the use of this operator, I encountered the CUDA_ERROR_ILLEGAL_ADDRESS error. I am confident that this CUDA_ERROR_ILLEGAL_ADDRESS does not originate from our implemented custom-call operator; it is more likely to come from XLA. Our jaxlib version is 0.4.7.
I would like to know if I can adjust the flags of XLA or upgrade to a higher version to fix this problem?

here is my logs about the xla:

jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.func.launch' failed: Failed to launch CUDA kernel: add with block dimensions: 128x1x1 and grid dimensions: 448x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered.
2024-08-10 13:09:02.411862: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:663] failed to unload module 0x5215a9a0; leaking: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:02.416148: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:663] failed to unload module 0x512060a0; leaking: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

2024-08-10 13:09:04.162614: E external/xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.162654: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:1035] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.162663: E external/xla/xla/stream_executor/stream.cc:1120] Error waiting for event in stream: error recording waiting for CUDA event on stream 0xdde638e0; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2024-08-10 13:09:04.162674: E external/xla/xla/stream_executor/cuda/cuda_gpu_executor.cc:790] failed to record completion event; therefore, failed to create inter-stream dependency
2024-08-10 13:09:04.162684: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:617] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.162697: E external/xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.162702: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:1035] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.162706: E external/xla/xla/stream_executor/stream.cc:1120] Error waiting for event in stream: error recording waiting for CUDA event on stream 0xc87e6730; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2024-08-10 13:09:04.162711: E external/xla/xla/stream_executor/cuda/cuda_gpu_executor.cc:790] failed to record completion event; therefore, failed to create inter-stream dependency
2024-08-10 13:09:04.162717: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:617] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.162726: E external/xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.162732: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:1035] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.162736: E external/xla/xla/stream_executor/stream.cc:1120] Error waiting for event in stream: error recording waiting for CUDA event on stream 0x557f5d60; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2024-08-10 13:09:04.162742: E external/xla/xla/stream_executor/cuda/cuda_gpu_executor.cc:790] failed to record completion event; therefore, failed to create inter-stream dependency
2024-08-10 13:09:04.162748: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:617] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169708: E external/xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169736: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169745: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169755: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169760: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169772: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:617] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
2024-08-10 13:09:04.169789: E external/xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169794: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169797: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169804: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169809: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169815: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:617] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
2024-08-10 13:09:04.169823: E external/xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169830: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169835: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169842: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169847: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169855: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:617] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
2024-08-10 13:09:04.169891: E external/xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169897: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169900: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169905: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169910: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169916: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:617] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
2024-08-10 13:09:04.169924: E external/xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169929: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169935: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169942: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169949: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169954: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:617] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
2024-08-10 13:09:04.169967: E external/xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169972: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169975: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169981: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:698] could not allocate CUDA stream for context 0x2947f40: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-08-10 13:09:04.169987: E external/xla/xla/stream_executor/stream.cc:312] failed to allocate stream during initialization
2024-08-10 13:09:04.169993: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:617] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle

custom call 'xla.gpu.func.launch' failed - looks like you are using really old XLA version, this execution path was disabled in January 2024