do you really need tensorflow addons?

Question

do you really need tensorflow addons?

zcy618 opened this issue 4 years ago · 31 comments

(venv_athena) (base) [aa@aa athena]$ ./examples/asr/aishell/run.sh
Creating csv
/data/nfs_rt16/aa/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/utils/ensure_tf_install.py:68: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.2.0 and strictly below 2.4.0 (nightly versions are not supported).
The versions of TensorFlow you are currently using is 2.0.0 and is not supported.
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
UserWarning,
Traceback (most recent call last):
File "examples/asr/aishell/local/prepare_data.py", line 29, in
from athena import get_wave_file_length
File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 35, in
from .layers.commons import PositionalEncoding
File "/data/nfs_rt16/chenyu/asr/athena/athena/layers/commons.py", line 22, in
import tensorflow_addons as tfa
File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/init.py", line 21, in
from tensorflow_addons import activations
File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/activations/init.py", line 17, in
from tensorflow_addons.activations.gelu import gelu
File "/data/nfs_rt16/aa/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/activations/gelu.py", line 27, in
@tf.keras.utils.register_keras_serializable(package="Addons")
AttributeError: module 'tensorflow_core.keras.utils' has no attribute 'register_keras_serializable'
(venv_athena) (base) [aa@aa athena]$

In your latest version, it reports above errors, if I remove tensorflow addons, it will report errors like below:
Creating csv
Traceback (most recent call last):
File "examples/asr/aishell/local/prepare_data.py", line 29, in
from athena import get_wave_file_length
File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 35, in
from .layers.commons import PositionalEncoding
File "/data/nfs_rt16/chenyu/asr/athena/athena/layers/commons.py", line 22, in
import tensorflow_addons as tfa
ModuleNotFoundError: No module named 'tensorflow_addons'

Could you check if you miss something please?
Thanks.

Answer 1 · 2020-08-24T12:32:52.000Z

tf 2.0.0 is no longer used because of security issues. Current version of tf for Athena is 2.0.1 (been bumped in this pr). Please try upgrading tf to 2.0.1 first.

If that issue still persists, I guess you could always comment out codes using tfa and the import itself (basically comment out all code related to InstanceNormalization in commons.py)...

Answer 2 · 2020-08-24T14:18:53.000Z

tf 2.0.0 is no longer used because of security issues. Current version of tf for Athena is 2.0.1 (been bumped in this pr). Please try upgrading tf to 2.0.1 first.

If that issue still persists, I guess you could always comment out codes using tfa and the import itself (basically comment out all code related to InstanceNormalization in commons.py)...

hi Some random, I will try 2.0.1, but I still have one question and one suggestion:
1, actually, I do not understand what do you mean for "If that issue still persists, I guess you could always comment out codes using tfa and the import itself (basically comment out all code related to InstanceNormalization in commons.py)..."

2, I have tried athena many days, from my first time to now, I have met many problems because different version's change, so I have some suggestion that could you release stable version, I think maybe you could make stable release with necessary test, and alpha or beta version during development without enough test.
Thanks

Answer 3 · 2020-08-26T05:30:56.000Z

tf 2.0.0 is no longer used because of security issues. Current version of tf for Athena is 2.0.1 (been bumped in this pr). Please try upgrading tf to 2.0.1 first.
If that issue still persists, I guess you could always comment out codes using tfa and the import itself (basically comment out all code related to InstanceNormalization in commons.py)...

hi Some random, I will try 2.0.1, but I still have one question and one suggestion:
1, actually, I do not understand what do you mean for "If that issue still persists, I guess you could always comment out codes using tfa and the import itself (basically comment out all code related to InstanceNormalization in commons.py)..."

2, I have tried athena many days, from my first time to now, I have met many problems because different version's change, so I have some suggestion that could you release stable version, I think maybe you could make stable release with necessary test, and alpha or beta version during development without enough test.
Thanks

Thank you so much for your suggestions! Athena is obviously still in development and there are lots of changes happening. We will release a stable version of Athena with a thorough test once we finished the modules we're currently developing and conduct more tests when we add something new.

As for your specific question, we will update tf version requirements in README.md. What I meant by "commenting out tfa..." is you can simply comment out lines in commons.py that use tfa (that will be lines 22, 122, 123, 127, 129, 143, 144, 14 and 150).

Answer 4 · 2020-08-31T05:44:48.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Answer 5 · 2020-08-31T07:59:23.000Z

hi dear friends:
I have upgraded tensorflow from 2.0.0 to 2.0.1, but there are errors like earlier yet:

Fine-tuning
[1,1]:/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/utils/ensure_tf_install.py:68: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.2.0 and strictly below 2.4.0 (nightly versions are not supported).
[1,1]: The versions of TensorFlow you are currently using is 2.0.1 and is not supported.
[1,1]:Some things might work, some things might not.
[1,1]:If you were to encounter a bug, do not file an issue.
[1,1]:If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
[1,1]:You can find the compatibility matrix in TensorFlow Addon's readme:
[1,1]:https://github.com/tensorflow/addons
[1,1]: UserWarning,
[1,0]:/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/utils/ensure_tf_install.py:68: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.2.0 and strictly below 2.4.0 (nightly versions are not supported).
[1,0]: The versions of TensorFlow you are currently using is 2.0.1 and is not supported.
[1,0]:Some things might work, some things might not.
[1,0]:If you were to encounter a bug, do not file an issue.
[1,0]:If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
[1,0]:You can find the compatibility matrix in TensorFlow Addon's readme:
[1,0]:https://github.com/tensorflow/addons
[1,0]: UserWarning,
[1,1]:Traceback (most recent call last):
[1,1]: File "athena/horovod_main.py", line 25, in
[1,1]: from athena import HorovodSolver
[1,1]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 35, in
[1,1]: from .layers.commons import PositionalEncoding
[1,1]: File "/data/nfs_rt16/chenyu/asr/athena/athena/layers/commons.py", line 22, in
[1,1]: import tensorflow_addons as tfa
[1,1]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/init.py", line 21, in
[1,1]: from tensorflow_addons import activations
[1,1]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/activations/init.py", line 17, in
[1,1]: from tensorflow_addons.activations.gelu import gelu
[1,1]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/activations/gelu.py", line 27, in
[1,1]: @tf.keras.utils.register_keras_serializable(package="Addons")
[1,1]:AttributeError: module 'tensorflow_core.keras.utils' has no attribute 'register_keras_serializable'
[1,0]:Traceback (most recent call last):
[1,0]: File "athena/horovod_main.py", line 25, in
[1,0]: from athena import HorovodSolver
[1,0]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 35, in
[1,0]: from .layers.commons import PositionalEncoding
[1,0]: File "/data/nfs_rt16/chenyu/asr/athena/athena/layers/commons.py", line 22, in
[1,0]: import tensorflow_addons as tfa
[1,0]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/init.py", line 21, in
[1,0]: from tensorflow_addons import activations
[1,0]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/activations/init.py", line 17, in
[1,0]: from tensorflow_addons.activations.gelu import gelu
[1,0]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/activations/gelu.py", line 27, in
[1,0]: @tf.keras.utils.register_keras_serializable(package="Addons")
[1,0]:AttributeError: module 'tensorflow_core.keras.utils' has no attribute 'register_keras_serializable'
[1,3]:/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/utils/ensure_tf_install.py:68: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.2.0 and strictly below 2.4.0 (nightly versions are not supported).
[1,3]: The versions of TensorFlow you are currently using is 2.0.1 and is not supported.
[1,3]:Some things might work, some things might not.
[1,3]:If you were to encounter a bug, do not file an issue.
[1,3]:If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
[1,3]:You can find the compatibility matrix in TensorFlow Addon's readme:
[1,3]:https://github.com/tensorflow/addons
[1,3]: UserWarning,
[1,3]:Traceback (most recent call last):
[1,3]: File "athena/horovod_main.py", line 25, in
[1,3]: from athena import HorovodSolver
[1,3]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 35, in
[1,3]: from .layers.commons import PositionalEncoding
[1,3]: File "/data/nfs_rt16/chenyu/asr/athena/athena/layers/commons.py", line 22, in
[1,3]: import tensorflow_addons as tfa
[1,3]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/init.py", line 21, in
[1,3]: from tensorflow_addons import activations
[1,3]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/activations/init.py", line 17, in
[1,3]: from tensorflow_addons.activations.gelu import gelu
[1,3]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/activations/gelu.py", line 27, in
[1,3]: @tf.keras.utils.register_keras_serializable(package="Addons")
[1,3]:AttributeError: module 'tensorflow_core.keras.utils' has no attribute 'register_keras_serializable'
[1,2]:/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/utils/ensure_tf_install.py:68: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.2.0 and strictly below 2.4.0 (nightly versions are not supported).
[1,2]: The versions of TensorFlow you are currently using is 2.0.1 and is not supported.
[1,2]:Some things might work, some things might not.
[1,2]:If you were to encounter a bug, do not file an issue.
[1,2]:If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
[1,2]:You can find the compatibility matrix in TensorFlow Addon's readme:
[1,2]:https://github.com/tensorflow/addons
[1,2]: UserWarning,
[1,2]:Traceback (most recent call last):
[1,2]: File "athena/horovod_main.py", line 25, in
[1,2]: from athena import HorovodSolver
[1,2]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 35, in
[1,2]: from .layers.commons import PositionalEncoding
[1,2]: File "/data/nfs_rt16/chenyu/asr/athena/athena/layers/commons.py", line 22, in
[1,2]: import tensorflow_addons as tfa
[1,2]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/init.py", line 21, in
[1,2]: from tensorflow_addons import activations
[1,2]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/activations/init.py", line 17, in
[1,2]: from tensorflow_addons.activations.gelu import gelu
[1,2]: File "/data/nfs_rt16/chenyu/asr/venv_athena/lib/python3.7/site-packages/tensorflow_addons/activations/gelu.py", line 27, in
[1,2]: @tf.keras.utils.register_keras_serializable(package="Addons")
[1,2]:AttributeError: module 'tensorflow_core.keras.utils' has no attribute 'register_keras_serializable'

so how should I fix it please?
Thanks.

Answer 6 · 2020-08-31T12:20:34.000Z

We've decided the compatibility mechanism of tensorflow addons is too user-unfriendly and we're removing the dependency of it in this PR. Feel free to pull from it!

Answer 7 · 2020-09-01T02:52:12.000Z

hi Some-random:
Thanks very much for quick response!

Now I pull the latest code, but I get the new issue:
[1,0]:2020-08-31 22:46:45.037824: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,2]:2020-08-31 22:46:45.039328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,1]:2020-08-31 22:46:45.043928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,3]:2020-08-31 22:46:45.242072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
[1,0]:2020-08-31 22:46:45.273337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
[1,2]:2020-08-31 22:46:45.274437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
[1,1]:2020-08-31 22:46:45.277094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
[1,2]:Traceback (most recent call last):
[1,2]: File "athena/horovod_main.py", line 41, in
[1,2]: HorovodSolver.initialize_devices(p.solver_gpu)
[1,2]: File "/data/nfs_rt16/chenyu/asr/athena/athena/solver.py", line 155, in initialize_devices
[1,2]: raise ValueError("If the list of solver gpus is not empty, its size should " +
[1,2]:ValueError: If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration
[1,0]:Traceback (most recent call last):
[1,0]: File "athena/horovod_main.py", line 41, in
[1,0]: HorovodSolver.initialize_devices(p.solver_gpu)
[1,0]: File "/data/nfs_rt16/chenyu/asr/athena/athena/solver.py", line 155, in initialize_devices
[1,0]: raise ValueError("If the list of solver gpus is not empty, its size should " +
[1,0]:ValueError: If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration
[1,3]:Traceback (most recent call last):
[1,3]: File "athena/horovod_main.py", line 41, in
[1,3]: HorovodSolver.initialize_devices(p.solver_gpu)
[1,3]: File "/data/nfs_rt16/chenyu/asr/athena/athena/solver.py", line 155, in initialize_devices
[1,3]: raise ValueError("If the list of solver gpus is not empty, its size should " +
[1,3]:ValueError: If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration
[1,1]:Traceback (most recent call last):
[1,1]: File "athena/horovod_main.py", line 41, in
[1,1]: HorovodSolver.initialize_devices(p.solver_gpu)
[1,1]: File "/data/nfs_rt16/chenyu/asr/athena/athena/solver.py", line 155, in initialize_devices
[1,1]: raise ValueError("If the list of solver gpus is not empty, its size should " +
[1,1]:ValueError: If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[56126,1],0]
Exit code: 1

I just start stage from 3 to run.
Thanks

Answer 8 · 2020-09-01T03:15:47.000Z

hi Some-random:
Thanks very much for quick response!

Now I pull the latest code, but I get the new issue:

[1,0]:2020-08-31 22:46:45.037824: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,2]:2020-08-31 22:46:45.039328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,1]:2020-08-31 22:46:45.043928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[1,3]:2020-08-31 22:46:45.242072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
[1,0]:2020-08-31 22:46:45.273337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
[1,2]:2020-08-31 22:46:45.274437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
[1,1]:2020-08-31 22:46:45.277094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
[1,2]:Traceback (most recent call last):
[1,2]: File "athena/horovod_main.py", line 41, in
[1,2]: HorovodSolver.initialize_devices(p.solver_gpu)
[1,2]: File "/data/nfs_rt16/chenyu/asr/athena/athena/solver.py", line 155, in initialize_devices
[1,2]: raise ValueError("If the list of solver gpus is not empty, its size should " +
[1,2]:ValueError: If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration
[1,0]:Traceback (most recent call last):
[1,0]: File "athena/horovod_main.py", line 41, in
[1,0]: HorovodSolver.initialize_devices(p.solver_gpu)
[1,0]: File "/data/nfs_rt16/chenyu/asr/athena/athena/solver.py", line 155, in initialize_devices
[1,0]: raise ValueError("If the list of solver gpus is not empty, its size should " +
[1,0]:ValueError: If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration
[1,3]:Traceback (most recent call last):
[1,3]: File "athena/horovod_main.py", line 41, in
[1,3]: HorovodSolver.initialize_devices(p.solver_gpu)
[1,3]: File "/data/nfs_rt16/chenyu/asr/athena/athena/solver.py", line 155, in initialize_devices
[1,3]: raise ValueError("If the list of solver gpus is not empty, its size should " +
[1,3]:ValueError: If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration
[1,1]:Traceback (most recent call last):
[1,1]: File "athena/horovod_main.py", line 41, in
[1,1]: HorovodSolver.initialize_devices(p.solver_gpu)
[1,1]: File "/data/nfs_rt16/chenyu/asr/athena/athena/solver.py", line 155, in initialize_devices
[1,1]: raise ValueError("If the list of solver gpus is not empty, its size should " +
[1,1]:ValueError: If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration

Primary job terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[56126,1],0]

Exit code: 1
I just start stage from 3 to run.
Thanks

If you want to use multiple gpus, horovod is needed and the command line should be like this:
horovodrun -np 4 -H localhost:4 xxxxxxx
"4" is the number of gpus you want to use.
The number of visible gpus you specified should not be smaller than this number. Or you can set the solver_gpu to empty, it will automatically use first 4 gpus.

Answer 9 · 2020-09-01T05:54:09.000Z

hi cookingbear:
I am using examples/asr/aishell/run.sh, it run correctly eariler before, but after I pulled your latest code, it reports above errors, so do you think I missed something?
Thanks.

Answer 10 · 2020-09-01T07:05:49.000Z

can you set solver_gpu in the corresponding config to empty and try again? that may be caused by a missing update

Answer 11 · 2020-09-01T08:46:43.000Z

I have changed "solver_gpu":[0], to "solver_gpu":[], in examples/asr/aishell/configs/mtl_transformer_sp.json, but it still reports such errors, could you tell me where do you suggest to change please?
Thanks.

Answer 12 · 2020-09-01T08:54:55.000Z

can you please show the command which reported this error?

Answer 13 · 2020-09-01T09:25:08.000Z

./example/asr/aishell/run.sh
Thanks

Answer 14 · 2020-09-01T09:26:24.000Z

./example/asr/aishell/run.sh
Thanks

which command line in run.sh

Answer 15 · 2020-09-02T02:34:30.000Z

./example/asr/aishell/run.sh
Thanks

which command line in run.sh

I just change stage=3.
Thanks

Answer 16 · 2020-09-03T08:49:57.000Z

any update please?
Thanks.

Answer 17 · 2020-09-04T11:54:08.000Z

any update please?
Thanks.

Sorry for the late reply. I tested run.sh again and your problem did not occur. Can you please pull the newest code and test it again?

Answer 18 · 2020-09-08T05:44:40.000Z

hi cookingbear:
According to your suggestion, I git pull latest master's code, and rerun: ./example/asr/aishell/run.sh, but I got one new error:
Fine-tuning
[1,2]:Traceback (most recent call last):
[1,2]: File "athena/horovod_main.py", line 25, in
[1,2]: from athena import HorovodSolver
[1,2]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 18, in
[1,2]: from .data import SpeechRecognitionDatasetBuilder
[1,2]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/init.py", line 21, in
[1,2]: from .datasets.speech_set import SpeechDatasetBuilder
[1,2]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/datasets/speech_set.py", line 24, in
[1,2]: class SpeechDatasetBuilder(BaseDatasetBuilder):
[1,2]:NameError: name 'BaseDatasetBuilder' is not defined
[1,0]:Traceback (most recent call last):
[1,0]: File "athena/horovod_main.py", line 25, in
[1,0]: from athena import HorovodSolver
[1,0]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 18, in
[1,0]: from .data import SpeechRecognitionDatasetBuilder
[1,0]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/init.py", line 21, in
[1,0]: from .datasets.speech_set import SpeechDatasetBuilder
[1,0]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/datasets/speech_set.py", line 24, in
[1,0]: class SpeechDatasetBuilder(BaseDatasetBuilder):
[1,0]:NameError: name 'BaseDatasetBuilder' is not defined
[1,1]:Traceback (most recent call last):
[1,1]: File "athena/horovod_main.py", line 25, in
[1,1]: from athena import HorovodSolver
[1,1]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 18, in
[1,1]: from .data import SpeechRecognitionDatasetBuilder
[1,1]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/init.py", line 21, in
[1,1]: from .datasets.speech_set import SpeechDatasetBuilder
[1,1]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/datasets/speech_set.py", line 24, in
[1,1]: class SpeechDatasetBuilder(BaseDatasetBuilder):
[1,1]:NameError: name 'BaseDatasetBuilder' is not defined
[1,3]:Traceback (most recent call last):
[1,3]: File "athena/horovod_main.py", line 25, in
[1,3]: from athena import HorovodSolver
[1,3]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 18, in
[1,3]: from .data import SpeechRecognitionDatasetBuilder
[1,3]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/init.py", line 21, in
[1,3]: from .datasets.speech_set import SpeechDatasetBuilder
[1,3]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/datasets/speech_set.py", line 24, in
[1,3]: class SpeechDatasetBuilder(BaseDatasetBuilder):
[1,3]:NameError: name 'BaseDatasetBuilder' is not defined

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[24516,1],2]
Exit code: 1

Thanks.

Answer 19 · 2020-09-08T05:51:12.000Z

BTW, according to your recent code status, I feel that the project's status is not very stable, I always meet kinds of errors, so could I know when will one stable version may be released?
Thanks.

Answer 20 · 2020-09-10T06:49:09.000Z

hi cookingbear:

According to your suggestion, I git pull latest master's code, and rerun: ./example/asr/aishell/run.sh, but I got one new error:
Fine-tuning
[1,2]:Traceback (most recent call last):
[1,2]: File "athena/horovod_main.py", line 25, in
[1,2]: from athena import HorovodSolver
[1,2]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 18, in
[1,2]: from .data import SpeechRecognitionDatasetBuilder
[1,2]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/init.py", line 21, in
[1,2]: from .datasets.speech_set import SpeechDatasetBuilder
[1,2]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/datasets/speech_set.py", line 24, in
[1,2]: class SpeechDatasetBuilder(BaseDatasetBuilder):
[1,2]:NameError: name 'BaseDatasetBuilder' is not defined
[1,0]:Traceback (most recent call last):
[1,0]: File "athena/horovod_main.py", line 25, in
[1,0]: from athena import HorovodSolver
[1,0]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 18, in
[1,0]: from .data import SpeechRecognitionDatasetBuilder
[1,0]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/init.py", line 21, in
[1,0]: from .datasets.speech_set import SpeechDatasetBuilder
[1,0]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/datasets/speech_set.py", line 24, in
[1,0]: class SpeechDatasetBuilder(BaseDatasetBuilder):
[1,0]:NameError: name 'BaseDatasetBuilder' is not defined
[1,1]:Traceback (most recent call last):
[1,1]: File "athena/horovod_main.py", line 25, in
[1,1]: from athena import HorovodSolver
[1,1]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 18, in
[1,1]: from .data import SpeechRecognitionDatasetBuilder
[1,1]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/init.py", line 21, in
[1,1]: from .datasets.speech_set import SpeechDatasetBuilder
[1,1]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/datasets/speech_set.py", line 24, in
[1,1]: class SpeechDatasetBuilder(BaseDatasetBuilder):
[1,1]:NameError: name 'BaseDatasetBuilder' is not defined
[1,3]:Traceback (most recent call last):
[1,3]: File "athena/horovod_main.py", line 25, in
[1,3]: from athena import HorovodSolver
[1,3]: File "/data/nfs_rt16/chenyu/asr/athena/athena/init.py", line 18, in
[1,3]: from .data import SpeechRecognitionDatasetBuilder
[1,3]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/init.py", line 21, in
[1,3]: from .datasets.speech_set import SpeechDatasetBuilder
[1,3]: File "/data/nfs_rt16/chenyu/asr/athena/athena/data/datasets/speech_set.py", line 24, in
[1,3]: class SpeechDatasetBuilder(BaseDatasetBuilder):
[1,3]:NameError: name 'BaseDatasetBuilder' is not defined

Primary job terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[24516,1],2]

Exit code: 1
Thanks.

Sorry for the late reply. Recently there are lots of updates. We will release a stable version after these updates. Now you can run the aishell recipe successfully after you pull the newest code.

Answer 21 · 2020-09-14T03:36:17.000Z

Dear friends:
According to your suggestion, I pull the latest code, with the latest code, I have completed the aishell data process, but I do not know if it is finished correctly, and the last log is "computing score with sclite ...", then there is no more log, and I checked the decode.log in athena folder as below:
decode.log

I do not see much useful log. And in the folder: athena/score_save, the decode.list file is empty.

Thanks

Answer 22 · 2020-09-15T03:09:34.000Z

Dear friends:
According to your suggestion, I pull the latest code, with the latest code, I have completed the aishell data process, but I do not know if it is finished correctly, and the last log is "computing score with sclite ...", then there is no more log, and I checked the decode.log in athena folder as below:
decode.log

I do not see much useful log. And in the folder: athena/score_save, the decode.list file is empty.

Thanks

Hi zcy618,

This happens because the decode results did not write to decode.log successfully by ">". I updated our codes in order to write the inference log directly to a file via python script in this pr. Try to pull the codes and feel free to reply if you have further questions.

Thanks,
Ne

Answer 23 · 2020-09-15T06:52:34.000Z

hi neneluo:
I pull the latest code, and change the stage=6 in example/asr/aishell/run.sh, the output is as below:

Do you think is it correct please? BTW, I upload the score folder's files here:

score.zip

Answer 24 · 2020-09-15T07:23:50.000Z

hi neneluo:
I pull the latest code, and change the stage=6 in example/asr/aishell/run.sh, the output is as below:

computing score with sclite ...
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| Sum/Avg| 0 0 | 0.0 0.0 0.0 0.0 0.0 0.0 |

Do you think is it correct please? BTW, I upload the score folder's files here:

score.zip

Hi zcy618,

No, I am afraid not, and your inference.log might be empty. Have you finished training and inference stages (i.e. stage 3 and stage 5 in run.sh)? Can you upload the training log and inference log too, please?

Thanks,
Ne

Answer 25 · 2020-09-16T07:27:30.000Z

hi neneluo:
I pull the latest code, and change the stage=6 in example/asr/aishell/run.sh, the output is as below:
computing score with sclite ...
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| Sum/Avg| 0 0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
Do you think is it correct please? BTW, I upload the score folder's files here:
score.zip

Hi zcy618,

No, I am afraid not, and your inference.log might be empty. Have you finished training and inference stages (i.e. stage 3 and stage 5 in run.sh)? Can you upload the training log and inference log too, please?

Thanks,
Ne

Yes, I completed them earlier, so after you update your code, I re-run from stage 6, do you think I need to start run from stage 3?
Thanks.

Answer 26 · 2020-09-16T08:05:16.000Z

hi neneluo:
I pull the latest code, and change the stage=6 in example/asr/aishell/run.sh, the output is as below:
computing score with sclite ...
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| Sum/Avg| 0 0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
Do you think is it correct please? BTW, I upload the score folder's files here:
score.zip

Hi zcy618,
No, I am afraid not, and your inference.log might be empty. Have you finished training and inference stages (i.e. stage 3 and stage 5 in run.sh)? Can you upload the training log and inference log too, please?
Thanks,
Ne

Yes, I completed them earlier, so after you update your code, I re-run from stage 6, do you think I need to start run from stage 3?
Thanks.

It is not necessary to re-run from stage 3. Please check your dev accuracy in the training log and make sure decode results appear in inference.log first.

The training log should contain lines like this:

INFO:absl:global_steps: 11753   learning_rate: 5.7651e-04       loss: 7.5574    Accuracy: 0.9115        CTCAccuracy: 0.8670     sec/iter: 0.4043
INFO:absl:global_steps: 11763   learning_rate: 5.7626e-04       loss: 23.4307   Accuracy: 0.9013        CTCAccuracy: 0.8503     sec/iter: 0.4290
INFO:absl:>>>>> start evaluate in epoch 16
INFO:absl:please be patient, enable tf.function, it takes time ...
WARNING:absl:the length of logits is shorter than that of labels
INFO:absl:loss: 19.1721 Accuracy: 0.8155        CTCAccuracy: 0.7443
INFO:absl:loss: 30.9493 Accuracy: 0.8029        CTCAccuracy: 0.7711
INFO:absl:loss: 30.2459 Accuracy: 0.7997        CTCAccuracy: 0.7715
INFO:absl:epoch: 16     loss: 31.2108   Accuracy: 0.7918        CTCAccuracy: 0.7697
INFO:absl:saving model in :examples/asr/timit/ckpts/mtl_transformer_ctc_sp/ckpt

(this is a log for another dataset, so the accuracy is low)
You can check whether your training process is correct by grep epoch: to see accuracies (this line in log file reports accuracy over dev set).

The inference.log should contain lines like this:

INFO:absl:predictions: tf.Tensor([[ 957  120  139 1471 1742 3665 1258 2553 3737 1172 4232]], shape=(1, 11), dtype=int64)        labels: [[ 110  120  139 1471 1742 3665 1258 2553 3737 1172]]   errs: 1 avg_acc: 0.9233 sec/iter: 2.4665
INFO:absl:predictions: tf.Tensor([[1011   71 4156 1149 4228 3609 2993 1778 3490 4232]], shape=(1, 10), dtype=int64)     labels: [[1011   71 4156 1149 4228 3609 2993 1778 3490]]        errs: 0 avg_acc: 0.9235 sec/iter: 2.2437
INFO:absl:predictions: tf.Tensor([[2896  463 3696  807  843  139 2553 2019 2130 4232]], shape=(1, 10), dtype=int64)     labels: [[2896  463 3696  807  843  139 2553 2019 2130]]        errs: 0 avg_acc: 0.9237 sec/iter: 2.1542
INFO:absl:predictions: tf.Tensor([[1966  815 1254 2661  426    7   28 1020 1674 4232]], shape=(1, 10), dtype=int64)     labels: [[1966  815 1254 2661  426    7   28 1020 1674]]        errs: 0 avg_acc: 0.9239 sec/iter: 2.2038

Otherwise, sclite cannot generate meaningful scoring results.

Answer 27 · 2020-09-21T08:48:30.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Answer 28 · 2020-09-22T02:11:41.000Z

hi neneluo:
I pull the latest code, and change the stage=6 in example/asr/aishell/run.sh, the output is as below:
computing score with sclite ...
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| Sum/Avg| 0 0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
Do you think is it correct please? BTW, I upload the score folder's files here:
score.zip

Hi zcy618,
No, I am afraid not, and your inference.log might be empty. Have you finished training and inference stages (i.e. stage 3 and stage 5 in run.sh)? Can you upload the training log and inference log too, please?
Thanks,
Ne

Yes, I completed them earlier, so after you update your code, I re-run from stage 6, do you think I need to start run from stage 3?
Thanks.

It is not necessary to re-run from stage 3. Please check your dev accuracy in the training log and make sure decode results appear in inference.log first.

The training log should contain lines like this:
INFO:absl:global_steps: 11753   learning_rate: 5.7651e-04       loss: 7.5574    Accuracy: 0.9115        CTCAccuracy: 0.8670     sec/iter: 0.4043
INFO:absl:global_steps: 11763   learning_rate: 5.7626e-04       loss: 23.4307   Accuracy: 0.9013        CTCAccuracy: 0.8503     sec/iter: 0.4290
INFO:absl:>>>>> start evaluate in epoch 16
INFO:absl:please be patient, enable tf.function, it takes time ...
WARNING:absl:the length of logits is shorter than that of labels
INFO:absl:loss: 19.1721 Accuracy: 0.8155        CTCAccuracy: 0.7443
INFO:absl:loss: 30.9493 Accuracy: 0.8029        CTCAccuracy: 0.7711
INFO:absl:loss: 30.2459 Accuracy: 0.7997        CTCAccuracy: 0.7715
INFO:absl:epoch: 16     loss: 31.2108   Accuracy: 0.7918        CTCAccuracy: 0.7697
INFO:absl:saving model in :examples/asr/timit/ckpts/mtl_transformer_ctc_sp/ckpt
(this is a log for another dataset, so the accuracy is low)
You can check whether your training process is correct by grep epoch: to see accuracies (this line in log file reports accuracy over dev set).

The inference.log should contain lines like this:
INFO:absl:predictions: tf.Tensor([[ 957  120  139 1471 1742 3665 1258 2553 3737 1172 4232]], shape=(1, 11), dtype=int64)        labels: [[ 110  120  139 1471 1742 3665 1258 2553 3737 1172]]   errs: 1 avg_acc: 0.9233 sec/iter: 2.4665
INFO:absl:predictions: tf.Tensor([[1011   71 4156 1149 4228 3609 2993 1778 3490 4232]], shape=(1, 10), dtype=int64)     labels: [[1011   71 4156 1149 4228 3609 2993 1778 3490]]        errs: 0 avg_acc: 0.9235 sec/iter: 2.2437
INFO:absl:predictions: tf.Tensor([[2896  463 3696  807  843  139 2553 2019 2130 4232]], shape=(1, 10), dtype=int64)     labels: [[2896  463 3696  807  843  139 2553 2019 2130]]        errs: 0 avg_acc: 0.9237 sec/iter: 2.1542
INFO:absl:predictions: tf.Tensor([[1966  815 1254 2661  426    7   28 1020 1674 4232]], shape=(1, 10), dtype=int64)     labels: [[1966  815 1254 2661  426    7   28 1020 1674]]        errs: 0 avg_acc: 0.9239 sec/iter: 2.2038
Otherwise, sclite cannot generate meaningful scoring results.

hi neneluo:
I totally re-run the run.sh, and this time, I think I get the right result, could you help to check it please?
inference.log.result.txt

Thanks

Answer 29 · 2020-09-22T02:33:37.000Z

I think it is correct.

Answer 30 · 2020-09-27T03:29:16.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Answer 31 · 2020-09-30T04:04:35.000Z

This issue is closed. You can also re-open it if needed.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[56126,1],0] Exit code: 1

Now I pull the latest code, but I get the new issue:

Primary job terminated normally, but 1 process returned

Process name: [[56126,1],0]

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[24516,1],2] Exit code: 1

hi cookingbear:

Primary job terminated normally, but 1 process returned

Process name: [[24516,1],2]

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[56126,1],0]
Exit code: 1

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[24516,1],2]
Exit code: 1