sp-uhh/storm

Stereo audio

adeelabbas opened this issue · 1 comments

Hi,
Does the tool work with stereo audio input? Do you know what changes would be needed to support it?
Adeel

Hi,
You can provide multichannel inputs in a batch to the diffusion model. This will treat each channel independently: therefore, the inter-channel magnitude/phase difference information will not be exploited, and there is no guarantee than these will be preserved by the processing.
For your information however, we researched into diffusion models leveraging multi-channel information and can share a few insights, part of which confirm results of Tesch et al. , "Insights Into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement", TASL 2022 and Tesch et al. , "Nonlinear Spatial Filtering in Multichannel Speech Enhancement", TASL 2021:

  • For denoising, the performance entirely depends on the noise signals: if there is enough temporo-spectral independence between the target speech and the noise signal, as in e.g. background unstationary car noise, then leveraging the spectral information is sufficient and using multi-channel signals will not significantly help, or can be performed separately from the spectral single-channel processing. If however your noise signals share significant amount of information with your target as in e.g. babble speech, then multi-channel information will help separate the two signals, and we have seen that multi-channel diffusion models were useful for that. However they were not performing better than discriminative models used in Tesch et al., so we did not report further on that.

  • For dereverebration, it seems the information leveraged by the current framework to reconstruct the data is mostly contained in the time-frequency spectrogram and not so much in the spatial distribution. Therefore, single-channel dereverbreation with diffusion models was performing slightly better than multi-channel models.

Of course, this is to be considered carefully: the extension we proposed for multi-channel processing was very naive and could probably be improved. Given the preliminary results, we did not research further in that direction.