yluo42/TAC

Use FasNet as Frontend

grid-searcher opened this issue · 2 comments

Hi,

I'm trying to use FasNet as frontend to denoise and dereverb the audio at the same time. I noticed that in create_dataset.py, you save spk1_echoic_sig and spk2_echoic_sig as label, which I think refers to "Reverberant clean" in your paper (please correct me if i'm wrong). What should I do if I want to do dereverberation (like "Clean source" or "Mel-spectrogram" in your paper)?

To be more specific,

  1. Do I need to normalize or rescale the original audio (I find that the energy of the original audio is much larger than the echoic one. Si-sdr would even be a positive number when given original audio)?
  2. What should I do to follow the shift invariant training?
  3. If I what to learn Mel-spectrogram, what is the input? Audio or mel-spectrogram? Do I need to flatten the last dimension to be compatible with current loss function? (Since there will be one more dimension storing mel-spectrogram feature)

Hi,

  1. If your target for dereverberation is the direct path signal, you can generate it by truncating the RIR filters to only contain the ±5ms (or other ranges) of the first peak.
  2. The energy of the original signal is larger than the echoic one because the RIR filters have pretty small energies. If you don't care about the output energy (and use SI-SDR as the training objective), you can do whatever you want to the input and the target. If you want to preserve the energy of the original signal, it would be good if you keep them unchanged and use energy-sensitive objectives (like MSE or SNR).
  3. For shift invariant training, you can simply shift your target by certain numbers of samples (like ±50) and perform proper zero padding, and then calculate the SI-SDR loss on all the shifted targets and select the one with the largest SI-SDR (smallest negative SI-SDR) as the one for backpropagation.
  4. I don't know what would the oracle input feature be when you set Mel-spec as target, but one simple thing you can try is to remove the 1-D convolutional encoder part by the Mel-spec of all the channels. You might also need to revisit the NCC feature as I don't know if it will give you good performance with Mel-spec features. Regarding the loss function, you can simply use MSE as many other works do, or use its scale-invariant version by normalizing the power of the entire utterance for both the estimation and the target.

Closing this issue. Feel free to reopen it if you have more questions.