The real-time speech enhance is poor
SongJinXue opened this issue · 14 comments
The block length of 32 ms and the block shift of 8 ms for real-time speech enhancement is poor,but a single audio speech enhancement works well.
What causes it?
How can I improve ?
Noisy:
The block length of 32 ms and the block shift of 8 ms for real-time speech enhancement :
A single audio speech enhancement :
Hi Jinxue, thanks for your attention and feedback.
I guess the main reason for this difference is the lack of hidden states and cell states of LSTM. If you pursue frame-wise processing, there are two things:
- Except to input the feature frame by frame, changing
torch.nn.LSTM
class totorch.nn.LSTMCell
class is an essential step. Throughout using a for-loop, input the hidden states and cell states of the previous step to the current step. - In addition, you need to modify the normalization method to support the frame-wise mode. To be specific, first, calculate a mean value on each frame. Then, referring to
cumulative_laplace_norm
, using the previous mean values smooth the current mean value and normalize the current frame feature.
Note that changing from torch.nn.LSTM
class to torch.nn.LSTMCell
does not cause performance reduction as the former is just an encapsulation of the latter. In addition, I've tested the different normalization methods, at least, the performance of cumulative_laplace_norm
and offline_laplace_norm
(currently used) is nearly equal. There is another normalization method, named forgetting_norm
. forgetting_norm
updates the mean value of the current frame only using the feature context in a window with a fixed size. So it may be more suitable for real scenarios but the performance will be slightly worse.
After these changes should we retrain the model ? Because Im using the pretrained model. Thanks for help.
After these changes should we retrain the model ? Because Im using the pretrained model. Thanks for help.
Hi, this week I will release a cumulative pre-trained model.
Thanks. Can you push the code snippet for real-time frame wise processing ?
Hi, @Spelchure.
Q: After these changes should we retrain the model ? Because Im using the pretrained model. Thanks for help.
A: Here is a pre-trained FullSubNet using cumulative normalization. Its performance is rather close to performance on the FullSubNet using offline normalization.
Q: Can you push the code snippet for real-time frame wise processing ?
Sorry that I don't have enough time recently, But in the next month, I would release this frame-wise processing code. But before that, you could try write it by yourself. After downloading the FullSubNet using cumulative normalization, there are two things you need to do are changing the torch.nn.LSTM
to torch.nn.LSTMCell
and adding a for-loop.
Thanks for model and advice.
I can't use pretrained cumulative model after changing LSTM to LSTMCell for frame - wise processing. It has error : missing arguments and unexpected arguments in model. It is possible to use cumulative model only inferencing without training ? If it is possible where i am doing wrong ? (Im changing LSTM to LSTMCell in sequence_model.py)
I can't use pretrained cumulative model after changing LSTM to LSTMCell for frame - wise processing. It has error : missing arguments and unexpected arguments in model. It is possible to use cumulative model only inferencing without training ? If it is possible where i am doing wrong ? (Im changing LSTM to LSTMCell in sequence_model.py)
I tested LSTM and LSTMCell , it did not help. Then, I tired to input the hidden states and cell states of the previous step to the current step, which works well.
Hi Jinxue, thanks for your attention and feedback.
I guess the main reason for this difference is the lack of hidden states and cell states of LSTM. If you pursue frame-wise processing, there are two things:
- Except to input the feature frame by frame, changing
torch.nn.LSTM
class totorch.nn.LSTMCell
class is an essential step. Throughout using a for-loop, input the hidden states and cell states of the previous step to the current step.- In addition, you need to modify the normalization method to support the frame-wise mode. To be specific, first, calculate a mean value on each frame. Then, referring to
cumulative_laplace_norm
, using the previous mean values smooth the current mean value and normalize the current frame feature.Note that changing from
torch.nn.LSTM
class totorch.nn.LSTMCell
does not cause performance reduction as the former is just an encapsulation of the latter. In addition, I've tested the different normalization methods, at least, the performance ofcumulative_laplace_norm
andoffline_laplace_norm
(currently used) is nearly equal. There is another normalization method, namedforgetting_norm
.forgetting_norm
updates the mean value of the current frame only using the feature context in a window with a fixed size. So it may be more suitable for real scenarios but the performance will be slightly worse.
Thanks for your advice.
I inputed the hidden states and cell states of the previous step to the current step and modified the normalization method referring cumulative_laplace_norm, the real-time speech enhancement was as expected, but the performance is slightly worse. Expect your frame-wise processing code.
Hi, Jinxue
Generally speaking, the changing from torch.nn.LSTM
to torch.nn.LSTMCell
could not cause any performance degradation. Some trivial things that should be paid attention to:
- make sure that you are using the new pre-trained cumulative version FullSubNet, i.e.,
cum_fullsubnet_best_model_218epochs.tar
on the release page. - as you can see, for performance purposes,
cumulative norm
that I released is written in a compact style, i.e., in advance computing the statistical mean value of all frames for an utterance. You should separate this function using a frame-wise style. The point basically is to ensure that normalizing the current frame using the statistical mean value of previous all frames.
You could try to confirm these trivial things and if you have any further questions please contact me. Of course, if the problem still exists, directly contributing your frame-wise code to this project on GitHub is very welcome.
Hi SongJinXue, can you share the real-time code? I would be so appreciated it. Many thanks for considering my request.
将 LSTM 更改为 LSTMCell 进行逐帧处理后,我无法使用预训练的累积模型。它有错误:模型中缺少参数和意外参数。是否可以仅使用累积模型进行推理而不进行训练?如果可能我哪里做错了?(我在sequence_model.py中将LSTM更改为LSTMCell)
我测试了 LSTM 和 LSTMCell ,它没有帮助。然后,我厌倦了将上一步的隐藏状态和单元状态输入到当前步骤,效果很好。
I have also tried this part, and there is no difference between LSTM and LSTMCell here, but is the result of frame-by-frame processing unsatisfactory? Could you please provide the implementation of this part. Thank you .
你能分享一下实时代
您好,金雪,感谢您的关注和反馈。
我猜测造成这种差异的主要原因是 LSTM 缺乏状态和状态单元。如果你追求逐帧处理,有两件事:
- 除了隐藏逐帧输入特征之外,
torch.nn.LSTM
类与torch.nn.LSTMCell
类之间的转换是必不可少的一步。在整个使用循环的过程中,将上一步的状态和单元状态输入到当前步骤。- 另外,还需要修改归一化方法以支持frame-wise模式。具体来说,首先计算每一帧的动作。然后,参考,使用之前的均值对当前均值进行平滑,屏幕当前帧特征进行归一化。
cumulative_laplace_norm
请注意,
torch.nn.LSTM
此类更改为torch.nn.LSTMCell
不会导致性能下降,因为只是之前不久的封装。cumulative_laplace_norm
另外,我测试了不同的归一化方法,至少,和(目前使用的)的性能offline_laplace_norm
几乎可以。还有另一种标准化方法,称为。仅使用固定大小的窗口中的特征上下文来更新当前帧的控制器。所以可能更适合场景但性能会稍差一些。forgetting_norm
forgetting_norm
谢谢你的建议。 我将上一步的隐藏状态和单元状态输入到当前步骤中,并参考cumulative_laplace_norm修改归一化方法,实时语音增强符合预期,但性能稍差。期待您的逐帧处理代码。
Hello author, Could you please share the revised content of this part of streaming inference? Thank you very much