Creates an audio component that can be used to upload/record audio (as an input) or display audio (as an output).

创建一个音频组件,可以用于 上传/录制音频作为输入)或显示音频(作为输出)。


As input component(作为输入组件):

passes audio as one of these formats (depending on type)(以以下格式之一传递音频(取决于类型)):

🚀🚀🚀a str filepath, or tuple of (sample rate in Hz, audio data as numpy array).


If the latter, the audio data is a 16-bit int array whose values range from -32768 to 32767 and shape of the audio data array is (samples,) for mono audio or (samples, channels) for multi-channel audio.

如果是后者,则音频数据格式为 dtype=int16,其值范围从-32768到32767,并且音频数据数组的形状为(样本,)用于单声道音频,或(样本,通道)用于多通道音频。

Your function should accept one of these types(你的函数应该接受以下类型之一):

def predict(
	value: str | tuple[int, np.ndarray] | None

As output component(作为输出组件):

expects audio data in any of these formats(期望以以下任一格式提供音频数据):

a str or pathlib.Path filepath or URL to an audio file, or a bytes object (recommended for streaming), or a tuple of (sample rate in Hz, audio data as numpy array).


Note: if audio is supplied as a numpy array, the audio will be normalized by its peak value to avoid distortion or clipping in the resulting audio.


Your function should return one of these types(你的函数应该返回以下类型之一):

def predict(···) -> str | Path | bytes | tuple[int, np.ndarray] | None
	return value

Real Time Speech Recognition(实时语音识别):


Automatic speech recognition (ASR), the conversion of spoken speech to text, is a very important and thriving(非常成功;蓬勃发展) area of machine learning.


ASR algorithms run on practically(几乎;实际上) every smartphone, and are becoming increasingly embedded in professional workflows(工作流程), such as digital(数字的) assistants for nurses and doctors.


Because ASR algorithms are designed to be used directly by customers and end users, it is important to validate(验证) that they are behaving as expected when confronted(面对;遭遇) with a wide variety of speech patterns (different accents(口音), pitches(音调), and background audio conditions).


Using gradio, you can easily build a demo of your ASR model and share that with a testing team, or test it yourself by speaking through the microphone(麦克风) on your device.

使用 gradio ,你可以轻松地构建一个ASR模型的演示,并与测试团队分享,或者通过设备上的麦克风亲自测试。

This tutorial will show how to take a pretrained speech-to-text model and deploy it with a Gradio interface.


🚨🚨🚨We will start with a full-context model, in which the user speaks the entire audio before the prediction runs.

🚨🚨🚨我们将从一个完整上下文模型开始,用户需要在 "预测(函数)" 运行之前 "说出整个音频"。


🔥🔥🔥Then we will adapt the demo to make it streaming, meaning that the audio model will convert speech as you speak.




Make sure you have the gradio Python package already installed. You will also need a pretrained speech recognition model. In this tutorial, we will build demos from 2 ASR libraries:

确保你已经安装了gradio Python包。你还需要一个预训练的语音识别模型。在本教程中,我们将从2个ASR库构建演示:

  • Transformers (for this, pip install transformers and pip install torch)

Transformers(为此,执行 pip install transformerspip install torch 命令)

Make sure you have at least one of these installed so that you can follow along the tutorial.


You will also need ffmpeg installed on your system, if you do not already have it, to process files from the microphone.





Here’s how to build a real time speech recognition (ASR) app:


1. Set up the Transformers ASR Model(设置Transformers ASR模型):

First, you will need to have an ASR model that you have either trained yourself or you will need to download a pretrained model.


In this tutorial, we will start by using a pretrained ASR model from the model, whisper.

在这个教程中,我们将从使用一个名为 whisper 的预训练ASR模型开始。

Here is the code to load whisper from Hugging Face transformers:

这是从Hugging Face transformers加载whisper模型的代码:

from transformers import pipeline

p = pipeline("automatic-speech-recognition", model="openai/whisper-base.en")

That’s it!


2. Create a Full-Context ASR Demo with Transformers(使用Transformers创建一个完整上下文ASR演示):

We will start by creating a full-context ASR demo, in which the user speaks the full audio before using the ASR model to run inference.


This is very easy with Gradio — we simply create a function around the pipeline object above.


We will use gradio’s built in Audio component(组件), configured(配置) to take input from the user’s microphone and return a filepath for the recorded audio.


The output component will be a plain Textbox.


import gradio as gr
from transformers import pipeline
import numpy as np

transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-base.en")

def transcribe(audio):
    sr, y = audio
    y = y.astype(np.float32)
    y /= np.max(np.abs(y))

    return transcriber({"sampling_rate": sr, "raw": y})["text"]

demo = gr.Interface(


The transcribe function takes a single parameter audio which is a numpy array of the audio the user recorded.

transcribe: v. 记录;抄录;抄写;把…转成(另一种书写形式);改编(乐曲,以适合其他乐器或声部);用音标标音

transcribe 函数接受一个参数 audio,这是用户录制的音频的 numpy 数组。

The pipeline object expects this in float32 format, so we convert it first to float32, and then extract the transcribed text.

pipeline 对象期望以 float32 格式输入,因此我们首先将其转换为 float32,然后提取转录文本。

3. Create a Streaming ASR Demo with Transformers(使用Transformers创建一个流式ASR演示):

To make this a streaming demo, we need to make these changes:


  1. Set streaming=True in the Audio component(在 Audio 组件中设置 streaming=True)

  2. Set live=True in the Interface(在 Interface 中设置 live=True)

  3. Add a state to the interface to store the recorded audio of a user(在interface中添加一个状态state来存储用户录制的音频)

Take a look below.


import gradio as gr
from transformers import pipeline  # 用于调用预训练模型的库
import numpy as np

# 加载一个预训练的自动语音识别模型
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-base.en")

# 定义一个函数,用于处理流式音频数据并进行语音转写
def transcribe(stream, new_chunk):
    # new_chunk包含采样率(sr)和音频数据(y)
    sr, y = new_chunk
    # 将音频数据的数据类型转换为np.float32
    y = y.astype(np.float32)
    # 归一化音频数据,使其振幅位于[-1, 1]
    y /= np.max(np.abs(y))

    # 如果stream非空,则将新的音频数据添加到现有的音频数据流中
    if stream is not None:
        # 将y数组加到stream数组的末尾,形成一个新的数组,并将这个新数组赋值给stream变量。这样,stream就包含了到目前为止收集到的所有音频数据。
        stream = np.concatenate([stream, y])
        # 如果stream为空,则将新的音频数据初始化为音频数据流
        stream = y
    # 返回更新后的音频数据流,以及使用语音识别模型转写的文本
    return stream, transcriber({"sampling_rate": sr, "raw": stream})["text"]

# 创建一个Gradio接口
demo = gr.Interface(
    # 指定处理函数为transcribe
    # 输入为一个包含状态和音频的列表,音频输入通过麦克风获取,以流式形式传输
    ["state", gr.Audio(sources=["microphone"], streaming=True)],
    # 输出为一个包含状态和文本的列表
    ["state", "text"],
    # 设置为实时模式

# 启动应用

🟡🟡🟡注意: 上述代码运行到最后,处理的是完整的音频,所以速度会很慢。对应的,可以采用每次只处理短暂片段的方式。如果每次只处理短暂片段的方式,要考虑到上下文,又可能会出现重叠问题。

Notice now we have a state variable now, because we need to track(跟踪) all the audio history.


🔥🔥🔥transcribe gets called whenever there is a new small chunk of audio, but we also need to keep track of all the audio that has been spoken so far in state.

🔥🔥🔥每当有一个新的小音频块时,transcribe 函数就会被调用,但我们也需要在状态中跟踪到目前为止已经说过的所有音频。

As the interface runs, the transcribe function gets called, with a record of all the previously spoken audio in stream, as well as the new chunk of audio as new_chunk.

随着 interface 的运行,transcribe 函数被调用,伴随着音频流中之前说过的所有音频的记录,以及作为 new_chunk 的新音频块。

❓❓❓We return the new full audio so that can be stored back in state, and we also return the transcription.



Here we naively(缺乏经验、天真或过于简单化的) append the audio together and simply call the transcriber object on the entire audio.

在这里,我们天真地(也就是粗暴的)将音频连接在一起,并简单地对整个音频调用 transcriber 对象。

⚠️⚠️⚠️You can imagine more efficient(有效的) ways of handling this, such as re-processing only the last 5 seconds of audio whenever a new chunk of audio received.


Now the ASR model will run inference as you speak!





import gradio as gr
from transformers import pipeline
import numpy as np

# 指定模型的本地路径
model_path = "./large-v3"   # 这里使用的是HF开源的"openai/whisper-large-v3"
                            # 笔者使用的 NVIDIA A100-PCIE-40GB "openai/whisper-large-v3" 运行时占用的显存位 7269MiB / 40960MiB。
transcriber = pipeline("automatic-speech-recognition", model=model_path, device="cuda")

# 初始化一个列表来保存每个片段的转录文本
transcribed_texts = []

def transcribe(new_chunk):
    sr, y = new_chunk
    y = y.astype(np.float32)
    y /= np.max(np.abs(y))
    # 直接使用当前的音频片段进行转录,而不是累积音频数据
    transcribed_text = transcriber({"sampling_rate": sr, "raw": y})["text"]
    # 将转录出的文本添加到列表中
    # 返回截止目前为止的转录结果
    return transcribed_texts

demo = gr.Interface(
    gr.Audio(sources=["microphone"], streaming=True),
if __name__ == "__main__":
    demo.launch(server_name="", server_port=11147)