could not synchronize on CUDA context

Question

could not synchronize on CUDA context

lethanhson9901 opened this issue 3 years ago · 1 comments

Today, I ran your acoustic model on colab and I got this issues

training: 0% 0/1900001 [00:00<?, ?it/s]2021-12-07 03:51:13.659473: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2085] Execution of replica 0 failed: INTERNAL: CUBLAS_STATUS_EXECUTION_FAILED
training: 0% 0/1900001 [00:16<?, ?it/s]
Traceback (most recent call last):
File "/content/drive/MyDrive/vietTTS/vietTTS/nat/acoustic_trainer.py", line 139, in
train()
File "/content/drive/MyDrive/vietTTS/vietTTS/nat/acoustic_trainer.py", line 101, in train
loss, (params, aux, rng, optim_state) = update(params, aux, rng, optim_state, batch)
File "/usr/local/lib/python3.7/dist-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/jax/_src/api.py", line 419, in cache_miss
donated_invars=donated_invars, inline=inline)
File "/usr/local/lib/python3.7/dist-packages/jax/core.py", line 1632, in bind
return call_bind(self, fun, *args, **params)
File "/usr/local/lib/python3.7/dist-packages/jax/core.py", line 1623, in call_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/usr/local/lib/python3.7/dist-packages/jax/core.py", line 1635, in process
return trace.process_call(self, fun, tracers, params)
File "/usr/local/lib/python3.7/dist-packages/jax/core.py", line 627, in process_call
return primitive.impl(f, *tracers, **params)
File "/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py", line 690, in _xla_call_impl
out = compiled_fun(*args)
File "/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py", line 1100, in _execute_compiled
out_bufs = compiled.execute(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: INTERNAL: CUBLAS_STATUS_EXECUTION_FAILED

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/drive/MyDrive/vietTTS/vietTTS/nat/acoustic_trainer.py", line 139, in
train()
File "/content/drive/MyDrive/vietTTS/vietTTS/nat/acoustic_trainer.py", line 101, in train
loss, (params, aux, rng, optim_state) = update(params, aux, rng, optim_state, batch)
File "/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py", line 1100, in _execute_compiled
out_bufs = compiled.execute(input_bufs)
RuntimeError: INTERNAL: CUBLAS_STATUS_EXECUTION_FAILED
2021-12-07 03:51:14.389335: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1047] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***

_PyModule_ClearDict
PyImport_Cleanup
Py_FinalizeEx

_Py_UnixMain
__libc_start_main
_start

*** End stack trace ***

2021-12-07 03:51:14.389456: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:124] Check failed: pair.first->SynchronizeAllActivity()

I guess this issue comes from mismatch version of requirements.
Could you please define your specific version of dependencies or update requirements ?

Answer 1 · 2021-12-31T14:51:52.000Z

I haved faced the same issue. My solution is moving all my workspace into another colab account (I train my model on google colab). But I think it's just a temporary way.