/asr-sdk-python

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

AILabs ASR Python software development kit

PyPI PyPI - License

Development Environment

  • Python 3.9
# install portaudio first if you develop on MAC OS X
brew install portaudio

pip install --global-option='build_ext' --global-option='-I/usr/local/include' --global-option='-L/usr/local/lib' -r requirements_dev.txt

# please check PyAudio site: https://people.csail.mit.edu/hubert/pyaudio/
# if you encouter some issues while installing PyAudio

Installation

pip install ailabs-asr

Samples

# init the streaming client
asr_client = StreamingClient('api-key-applied-from-devconsole')

# start streaming with wav file
asr_client.start_streaming_wav(
  pipeline='asr-zh-en-std',
  file='voice.wav'
  verbose=False, # enable verbose to show detailed recognition result
  on_processing_sentence=on_processing_sentence,
  on_final_sentence=on_final_sentence)

# without file to start streaming with the computer's microphone
asr_client.start_streaming_wav(
  pipeline='asr-zh-en-std',
  on_processing_sentence=on_processing_sentence,
  on_final_sentence=on_final_sentence)

💡 start_streaming_wav() method allow users to provide callback function to handle the recognition result see the result format below

💡 lookup the available pipelines in the next section

💡 see more samples in the sample respository

Support Language(pipeline)

pipeline Info language
asr-zh-en-std Use it when speakers speak Chinese more than English Mandarin and English
asr-zh-tw-std Use it when speakers speak Chinese and Taiwanese. Mandarin and Taiwanese
asr-en-std English English
asr-jp-std Japanese Japanese

Message Format

There are 2 kinds of recognized result:

The Processing Sentence(Segment)

{
  "asr_sentence": "範例句子"
}

The Final Sentence(Complete Sentence)

{
  "asr_final": true,
  "asr_begin_time": 9.314,
  "asr_end_time": 11.314,
  "asr_sentence": "完整的範例句子",
  "asr_confidence": 0.5263263653207881,
  "asr_word_time_stamp": [
    {
      "word": "完整的",
      "begin_time": 9.74021875,
      "end_time": 10.100875
    },
    {
      "word": "範例句子",
      "begin_time": 10.100875,
      "end_time": 10.1664375
    }
  ],
  "text_segmented": "完整的 範例句子"
}

Limitation

Audio Data

⚠️ Send audio data with binary frame with following spec:

  • Audio data format
    • 16kHz, mono
    • 16 bits per sample
    • PCM
  • Sample rate per secs: 16K(16000)
  • Sample sizes per sec: 16000(samples) x 1(sec) x 16/8(2 bytes) = 32000 bytes ~= 32 KB(/sec)
  • Each chunk size: 2000 bytes, 1/16 secs