athena-team/athena

There is a problem when testing my installation

huhailang9012 opened this issue · 2 comments

(venv_athena) maggie@maggie-Z490-UD:/venv_athena/athena$ source tools/env.sh
(venv_athena) maggie@maggie-Z490-UD:
/venv_athena/athena$ python examples/translate/spa-eng-example/prepare_data.py examples/translate/spa-eng-example/data/train.csv
INFO:absl:Successfully generated csv file examples/translate/spa-eng-example/data/train.csv
(venv_athena) maggie@maggie-Z490-UD:~/venv_athena/athena$ python athena/main.py examples/translate/spa-eng-example/transformer.json
There is some problem with your horovod installation. But it wouldn't affect single-gpu training
There is some problem with your horovod installation. But it wouldn't affect single-gpu training
pydecoder is not installed, this will only affect WFST decoding
INFO:absl:hparams: [('batch_size', 16), ('ckpt', 'examples/translate/spa-eng-example/ckpts/transformer'), ('cls', 'main'), ('convert_config', None), ('dataset_builder', 'language_dataset'), ('dev_dataset_builder', None), ('devset_config', {'data_csv': 'examples/translate/spa-eng-example/data/train.csv', 'input_text_config': {'type': 'text'}, 'output_text_config': {'type': 'text'}}), ('inference_config', None), ('model', 'translate_transformer'), ('model_config', {'d_model': 512, 'num_heads': 8, 'num_encoder_layers': 12, 'num_decoder_layers': 6, 'dff': 2048, 'rate': 0.1, 'label_smoothing_rate': 0.0}), ('num_classes', None), ('num_data_threads', 1), ('num_epochs', 20), ('optimizer', 'warmup_adam'), ('optimizer_config', {'d_model': 512, 'warmup_steps': 8000, 'k': 0.5}), ('pretrained_model', None), ('solver_config', {'clip_norm': 100, 'log_interval': 10, 'enable_tf_function': True}), ('solver_gpu', [0]), ('solver_type', 'asr'), ('sorta_epoch', 1), ('summary_dir', None), ('teacher_model', None), ('testset_config', None), ('trainset_config', {'data_csv': 'examples/translate/spa-eng-example/data/train.csv', 'input_text_config': {'type': 'text'}, 'output_text_config': {'type': 'text'}})]
2021-07-27 14:03:44.335595: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-07-27 14:03:44.345751: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 14:03:44.346085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: NVIDIA GeForce RTX 3070 major: 8 minor: 6 memoryClockRate(GHz): 1.815
pciBusID: 0000:01:00.0
2021-07-27 14:03:44.346211: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-07-27 14:03:44.346985: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-07-27 14:03:44.347620: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-07-27 14:03:44.347765: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-07-27 14:03:44.348604: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-07-27 14:03:44.349221: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-07-27 14:03:44.351107: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-07-27 14:03:44.351174: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 14:03:44.351523: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 14:03:44.351831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
INFO:absl:hparams: [('batch_size', 16), ('ckpt', 'examples/translate/spa-eng-example/ckpts/transformer'), ('cls', 'main'), ('convert_config', None), ('dataset_builder', 'language_dataset'), ('dev_dataset_builder', None), ('devset_config', {'data_csv': 'examples/translate/spa-eng-example/data/train.csv', 'input_text_config': {'type': 'text'}, 'output_text_config': {'type': 'text'}}), ('inference_config', None), ('model', 'translate_transformer'), ('model_config', {'d_model': 512, 'num_heads': 8, 'num_encoder_layers': 12, 'num_decoder_layers': 6, 'dff': 2048, 'rate': 0.1, 'label_smoothing_rate': 0.0}), ('num_classes', None), ('num_data_threads', 1), ('num_epochs', 20), ('optimizer', 'warmup_adam'), ('optimizer_config', {'d_model': 512, 'warmup_steps': 8000, 'k': 0.5}), ('pretrained_model', None), ('solver_config', {'clip_norm': 100, 'log_interval': 10, 'enable_tf_function': True}), ('solver_gpu', [0]), ('solver_type', 'asr'), ('sorta_epoch', 1), ('summary_dir', None), ('teacher_model', None), ('testset_config', None), ('trainset_config', {'data_csv': 'examples/translate/spa-eng-example/data/train.csv', 'input_text_config': {'type': 'text'}, 'output_text_config': {'type': 'text'}})]
INFO:absl:hparams: [('cls', <class 'athena.data.datasets.language_set.LanguageDatasetBuilder'>), ('data_csv', 'examples/translate/spa-eng-example/data/train.csv'), ('input_length_range', [1, 1000]), ('input_text_config', {'type': 'text'}), ('output_length_range', [1, 1000]), ('output_text_config', {'type': 'text'})]
INFO:absl:Loading data from examples/translate/spa-eng-example/data/train.csv
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118964/118964 [00:01<00:00, 73807.93it/s]
2021-07-27 14:03:47.612261: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-07-27 14:03:47.632301: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2899885000 Hz
2021-07-27 14:03:47.632745: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x630ec70 executing computations on platform Host. Devices:
2021-07-27 14:03:47.632756: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
2021-07-27 14:03:47.686435: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 14:03:47.686833: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x61692c0 executing computations on platform CUDA. Devices:
2021-07-27 14:03:47.686847: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): NVIDIA GeForce RTX 3070, Compute Capability 8.6
2021-07-27 14:03:47.686976: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 14:03:47.687276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: NVIDIA GeForce RTX 3070 major: 8 minor: 6 memoryClockRate(GHz): 1.815
pciBusID: 0000:01:00.0
2021-07-27 14:03:47.687308: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-07-27 14:03:47.687321: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-07-27 14:03:47.687331: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-07-27 14:03:47.687340: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-07-27 14:03:47.687349: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-07-27 14:03:47.687359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-07-27 14:03:47.687370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-07-27 14:03:47.687403: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 14:03:47.687710: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 14:03:47.687999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-07-27 14:03:47.688023: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-07-27 14:03:47.688816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-27 14:03:47.688825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-07-27 14:03:47.688829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-07-27 14:03:47.688885: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 14:03:47.689202: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 14:03:47.689507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7088 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070, pci bus id: 0000:01:00.0, compute capability: 8.6)
INFO:absl:trying to restore from : examples/translate/spa-eng-example/ckpts/transformer
2021-07-27 14:05:56.211044: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-07-27 14:05:56.577310: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "athena/main.py", line 198, in
train(jsonfile, BaseSolver, 1, 0)
File "athena/main.py", line 139, in train
p, model, optimizer, checkpointer = build_model_from_jsonfile(jsonfile)
File "athena/main.py", line 128, in build_model_from_jsonfile
solver.evaluate_step(model.prepare_samples(iter(dataset).next()))
File "/home/maggie/venv_athena/athena/athena/solver.py", line 118, in evaluate_step
outputs = self.model(samples, training=False)
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 891, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/home/maggie/venv_athena/athena/athena/models/translate_transformer.py", line 94, in call
training=training
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 891, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/home/maggie/venv_athena/athena/athena/layers/transformer.py", line 142, in call
return_attention_weights=return_attention_weights, training=training
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 891, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/home/maggie/venv_athena/athena/athena/layers/transformer.py", line 229, in call
training=training,
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 891, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/home/maggie/venv_athena/athena/athena/layers/transformer.py", line 383, in call
out = self.attn1(tgt, tgt, tgt, mask=tgt_mask)[0]
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 891, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/home/maggie/venv_athena/athena/athena/layers/attention.py", line 151, in call
q = self.wq(q) # (batch_size, seq_len, hiddn_dim)
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 891, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/keras/layers/core.py", line 1045, in call
outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]])
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 4077, in tensordot
ab_matmul = matmul(a_reshape, b_reshape)
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 2765, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/home/maggie/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6126, in mat_mul
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(48, 512), b.shape=(512, 512), m=48, n=512, k=512 [Op:MatMul] name: neural_translate_transformer/transformer/transformer_decoder/transformer_decoder_layer/multi_head_attention_12/dense_72/Tensordot/MatMul/

It is normal when I use cpu to train.

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale commented

This issue is closed. You can also re-open it if needed.