Problems to use vad_model, pucn_model and spk_model with streaming voice. 如何正常在流式处理中加载这3个模型?
Opened this issue · 2 comments
1113200320 commented
仿照readme.md尝试加载多个模型流式处理, 遇到问题
keyword: fsmn-vad, ct-punc, cam++, is_final
What have you tried?
step1: 复制readme.md中Speech Recognition (Streaming) 这一节的代码, 其中 model = AutoModel(model="paraformer-zh-streaming"
, 运行正常
(完整代码和readme.md相同, 贴在最后一部分方便阅读)
step2: 将示例代码的model修改为:
model = AutoModel(model="paraformer-zh-streaming",
vad_model="fsmn-vad",
punc_model="ct-punc",
spk_model="cam++",
)
...
model.generate(input=speech_chunk, cache=cache, is_final=is_final) # 只保留这3个参数
运行结果:
所有返回均为空 (例如: [{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '', 'timestamp': []}] 这样)
step3: 将示例代码的model修改为:
model = AutoModel(model="paraformer-zh",
vad_model="fsmn-vad",
punc_model="ct-punc",
spk_model="cam++",
)
...
model.generate(input=speech_chunk, cache=cache, is_final=is_final) # 只保留这3个参数
运行结果:
和2相比, 前面返回均为空, 但是最后一个片段时, 有is_final=True, 返回了对应片段的字符, 例如 [{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '模型', 'timestamp': ...}]
step4: 在3的基础上, 将model.generate改为model.generate(input=speech_chunk, cache=cache, is_final=True)
运行结果:
各分片运行正常, 但是由于总是设置is_final=True, 不能达到流式处理拼接对话和说话人区分的需求
What's your environment?
- OS Windows11, dont't use docker
- PyTorch Version (e.g., 2.0.0): 2.5.1
- How you installed funasr (
pip
, source): pip - Python version: 3.12.3
My Question:
为什么is_final=False时无法识别? 哪位可以给一个包含了vad_model, pucn_model and spk_model且可以进行流式处理的代码示例? 非常感谢!
附原始完整代码
from funasr import AutoModel
chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
model = AutoModel(model="paraformer-zh-streaming",
vad_model="fsmn-vad",
punc_model="ct-punc",
spk_model="cam++",
)
import soundfile
import os
wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms
cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
is_final = i == total_chunk_num - 1
res = model.generate(input=speech_chunk, cache=cache, is_final=is_final)
print(res)
wqzh commented
我也有类似的需求。如果楼主知道解决方法的话,请在评论区告知
AliceShen122 commented
+1