espnet_onnx

ESPNet without PyTorch!

Utility library to easily export espnet models to onnx format. There is no need to install PyTorch or ESPNet on your machine if you already have exported files!

Note

Currently TTS is not supported.

Install

espnet_onnx can be installed with pip

pip install espnet_onnx

If you want to export pretrained model, you need to install torch>=1.11.0, espnet, espnet_model_zoo, onnx additionally.

Usage

espnet_onnx can export pretrained model published on espnet_model_zoo.

By default, exported files will be stored in ${HOME}/.cache/espnet_onnx/<tag_name>.

from espnet2.bin.asr_inference import Speech2Text
from espnet_onnx.export import ModelExport

m = ModelExport()

# download with espnet_model_zoo and export from pretrained model
m.export_from_pretrained('<tag name>', quantize=True)

# export from trained model
speech2text = Speech2Text(args)
m.export(speech2text, '<tag name>', quantize=True)

For inference, tag_name or model_dir is used to load onnx file. tag_name has to be defined in tag_config.yaml

import librosa
from espnet_onnx import Speech2Text

speech2text = Speech2Text(tag_name='<tag name>')
# speech2text = Speech2Text(model_dir='path to the onnx directory')

y, sr = librosa.load('sample.wav', sr=16000)
nbest = speech2text(y)

For streaming asr, you can use StreamingSpeech2Text class. The speech length should be the same as StreamingSpeech2Text.hop_size

from espnet_onnx import StreamingSpeech2Text

stream_asr = StreamingSpeech2Text(tag_name)

# start streaming asr
stream_asr.start()
while streaming:
  wav = <some code to get wav>
  assert len(wav) == stream_asr.hop_size
  stream_text = stream_asr(wav)[0][0]

# You can get non-streaming asr result with end function
nbest = stream_asr.end()

You can also simulate streaming model with your wav file with simulate function. Passing True as the second argument will show the streaming text as the following code.

import librosa
from espnet_onnx import StreamingSpeech2Text

stream_asr = StreamingSpeech2Text(tag_name)
y, sr = librosa.load('path/to/wav', sr=16000)
nbest = stream_asr.simulate(y, True)
# Processing audio with 6 processes.
# Result at position 0 : 
# Result at position 1 : 
# Result at position 2 : this
# Result at position 3 : this is
# Result at position 4 : this is a
# Result at position 5 : this is a
print(nbest[0][0])
# 'this is a pen'

You can export pretrained model from zipped file. The zipped file should contain meta.yaml.

from espnet_onnx.export import ModelExport

m = ModelExport()
m.export_from_zip(
  'path/to/the/zipfile',
  tag_name='tag_name_for_zipped_model',
  quantize=True
)

You can use GPU for inference. Please see How to use GPU on espnet_onnx in detail.

How to use GPU on espnet_onnx

Install dependency.

First, we need onnxruntime-gpu library, instead of onnxruntime. Please follow this article to select and install the correct version of onnxruntime-gpu, depending on your CUDA version.

Inference on GPU

Now you can speedup the inference speed with GPU. All you need is to select the correct providers, and give it to the Speech2Text or StreamingSpeech2Text instance. See this article for more information about providers.

import librosa
from espnet_onnx import Speech2Text

PROVIDERS = ['CUDAExecutionProvider']
tag_name = 'some_tag_name'

speech2text = Speech2Text(
  tag_name,
  providers=PROVIDERS
)
y, sr = librosa.load('path/to/wav', sr=16000)
nbest = speech2text(y) # runs on GPU.

Note that some quantized models are not supported for GPU computation. If you got an error with quantized model, please try not-quantized model.

API Reference

espnet_onnx.Speech2Text

args

tag_name : tag_name defined in table.csv in espnet_model_zoo. If a user set a custom model_name when export model with export(), then tag_name should be model_name. The tag_name should be defined in tag_config.yaml, which will be created when exporting model.
model_dir: Path to the model directory. Configuration file should be located in <model_dir>/config.yaml
use_quantized: Flag to use quantized model.

espnet_onnx.export.ModelExport

function

export
- model: Instance of espnet2.bin.asr_inference.Speech2Text.
- tag_name: Tag name to identify onnx model.
- quantize: Flag to create quantized model.
export_from_pretrained
- tag_name: Tag name to identify onnx model.
- quantize: Flag to create quantized model.

Changes from ESPNet

To avoid the cache problem, I modified some scripts from the original espnet implementation.

Add <blank> before <sos>
Give some torch.zeros() arrays to the model.
Remove the first token in post process. (remove blank)

And I removed extend_pe() from positional encoding module. The length of pe is 512 by default.

Supported Archs

ASR: Supported architecture for ASR

References

COPYRIGHT

Released under MIT licence

Author

Masao Someki

contact: masao.someki@gmail.com

lvzhiqiang/espnet_onnx