modelscope/FunASR

Problems to use vad_model, pucn_model and spk_model with streaming voice. 如何正常在流式处理中加载这3个模型?

Opened this issue · 2 comments

仿照readme.md尝试加载多个模型流式处理, 遇到问题

keyword: fsmn-vad, ct-punc, cam++, is_final

What have you tried?

step1: 复制readme.md中Speech Recognition (Streaming) 这一节的代码, 其中 model = AutoModel(model="paraformer-zh-streaming", 运行正常
(完整代码和readme.md相同, 贴在最后一部分方便阅读)

step2: 将示例代码的model修改为:

model = AutoModel(model="paraformer-zh-streaming",
                  vad_model="fsmn-vad",  
                  punc_model="ct-punc", 
                  spk_model="cam++",
                  )
...
model.generate(input=speech_chunk, cache=cache, is_final=is_final) # 只保留这3个参数

运行结果:

所有返回均为空 (例如: [{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '', 'timestamp': []}] 这样)

step3: 将示例代码的model修改为:

model = AutoModel(model="paraformer-zh",
                  vad_model="fsmn-vad",  
                  punc_model="ct-punc", 
                  spk_model="cam++",
                  )
...
model.generate(input=speech_chunk, cache=cache, is_final=is_final) # 只保留这3个参数

运行结果:
和2相比, 前面返回均为空, 但是最后一个片段时, 有is_final=True, 返回了对应片段的字符, 例如 [{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '模型', 'timestamp': ...}]
step4: 在3的基础上, 将model.generate改为model.generate(input=speech_chunk, cache=cache, is_final=True)
运行结果:
各分片运行正常, 但是由于总是设置is_final=True, 不能达到流式处理拼接对话和说话人区分的需求

What's your environment?

  • OS Windows11, dont't use docker
  • PyTorch Version (e.g., 2.0.0): 2.5.1
  • How you installed funasr (pip, source): pip
  • Python version: 3.12.3

My Question:

为什么is_final=False时无法识别? 哪位可以给一个包含了vad_model, pucn_model and spk_model且可以进行流式处理的代码示例? 非常感谢!


附原始完整代码

from funasr import AutoModel

chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention

model = AutoModel(model="paraformer-zh-streaming",
                  vad_model="fsmn-vad",  
                  punc_model="ct-punc", 
                  spk_model="cam++",
                  )

import soundfile
import os

wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final)
    print(res)
wqzh commented

我也有类似的需求。如果楼主知道解决方法的话,请在评论区告知