TensorFlowASR âš¡
Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2
TensorFlowASR implements some automatic speech recognition architectures such as DeepSpeech2, Conformer, etc. These models can be converted to TFLite to reduce memory and computation for deployment 😄
What's New?
- (11/14/2020) Supported Gradient Accumulation for Training in Larger Batch Size
- (11/3/2020) Reduce differences between
librosa.stft
andtf.signal.stft
- (10/31/2020) Update DeepSpeech2 and Supported Jasper https://arxiv.org/abs/1904.03288
- (10/18/2020) Supported Streaming Transducer https://arxiv.org/abs/1811.06621
- (10/15/2020) Add gradients accumulation and Refactor to TensorflowASR
- (10/10/2020) Update documents and upload package to pypi
- (10/6/2020) Change
nlpaug
version to>=1.0.1
- (9/18/2020) Support
word-pieces
(akasubwords
) usingtensorflow-datasets
- Support
transducer
tflite greedy decoding (conversion and invocation) - Distributed training using
tf.distribute.MirroredStrategy
Table of Contents
- What's New?
- Table of Contents
- 😋 Supported Models
- Installation
- Setup training and testing
- TFLite Convertion
- Features Extraction
- Augmentations
- Training & Testing
- Corpus Sources and Pretrained Models
- References & Credits
😋 Supported Models
- CTCModel (End2end models using CTC Loss for training)
- Deep Speech 2 (Reference: https://arxiv.org/abs/1512.02595) See examples/deepspeech2
- Jasper (Reference: https://arxiv.org/abs/1904.03288) See examples/jasper
- Transducer Models (End2end models using RNNT Loss for training)
- Conformer Transducer (Reference: https://arxiv.org/abs/2005.08100) See examples/conformer
- Streaming Transducer (Reference: https://arxiv.org/abs/1811.06621) See examples/streaming_transducer
Installation
Install tensorflow>=2.3.0
or tf-nightly
.
For training and testing, you should use git clone
for installing necessary packages from other authors (ctc_decoders
, rnnt_loss
, etc.)
Installing via PyPi
Run pip3 install -U TensorFlowASR
Installing from source
git clone https://github.com/TensorSpeech/TensorFlowASR.git
cd TensorFlowASR
python3 setup.py install
For anaconda3:
conda create -y -n tfasr tensorflow-gpu python=3.7 # tensorflow if using CPU
conda activate tfasr
pip install -U tensorflow-gpu # upgrade to latest version of tensorflow
git clone https://github.com/TensorSpeech/TensorFlowASR.git
cd TensorFlowASR
python setup.py install
Setup training and testing
-
For datasets, see datasets
-
For training, testing and using CTC Models, run
./scripts/install_ctc_decoders.sh
-
For training Transducer Models, run
export CUDA_HOME=/usr/local/cuda && ./scripts/install_rnnt_loss.sh
(Note: onlyexport CUDA_HOME
when you have CUDA) -
For mixed precision training, use flag
--mxp
when running python scripts from examples -
For enabling XLA, run
TF_XLA_FLAGS=--tf_xla_auto_jit=2 python3 $path_to_py_script
)
TFLite Convertion
After converting to tflite, the tflite model is like a function that transforms directly from an audio signal to unicode code points, then we can convert unicode points to string.
- Install
tf-nightly
usingpip install tf-nightly
- Build a model with the same architecture as the trained model (if model has tflite argument, you must set it to True), then load the weights from trained model to the built model
- Load
TFSpeechFeaturizer
andTextFeaturizer
to model using functionadd_featurizers
- Convert model's function to tflite as follows:
func = model.make_tflite_function(greedy=True) # or False
concrete_func = func.get_concrete_function()
converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])
converter.experimental_new_converter = True
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
- Save the converted tflite model as follows:
if not os.path.exists(os.path.dirname(tflite_path)):
os.makedirs(os.path.dirname(tflite_path))
with open(tflite_path, "wb") as tflite_out:
tflite_out.write(tflite_model)
- Then the
.tflite
model is ready to be deployed
Features Extraction
Augmentations
See augmentations
Training & Testing
Example YAML Config Structure
speech_config: ...
model_config: ...
decoder_config: ...
learning_config:
augmentations: ...
dataset_config:
train_paths: ...
eval_paths: ...
test_paths: ...
tfrecords_dir: ...
optimizer_config: ...
running_config:
batch_size: 8
num_epochs: 20
outdir: ...
log_interval_steps: 500
See examples for some predefined ASR models and results
Corpus Sources and Pretrained Models
For pretrained models, go to drive
English
Name | Source | Hours |
---|---|---|
LibriSpeech | LibriSpeech | 970h |
Common Voice | https://commonvoice.mozilla.org | 1932h |
Vietnamese
Name | Source | Hours |
---|---|---|
Vivos | https://ailab.hcmus.edu.vn/vivos | 15h |
InfoRe Technology 1 | InfoRe1 (passwd: BroughtToYouByInfoRe) | 25h |
InfoRe Technology 2 (used in VLSP2019) | InfoRe2 (passwd: BroughtToYouByInfoRe) | 415h |
German
Name | Source | Hours |
---|---|---|
Common Voice | https://commonvoice.mozilla.org/ | 750h |