Applying the model for bandwidth extension (not super-resolution)
eloimoliner opened this issue · 12 comments
I am trying to train your model for the task of bandwidth extension, where the input observations are lowpassed with a certain lowpass filter at a given cutoff frequency. This differs from the super-resolution or upsampling scenario, where the lowpass filter is an antialiasing filter, which cutoff frequency is around the Nyquist of a low-resolution signal.
I understand that the spectral upsampling you propose serves as a strong regularization strategy to prevent filter overfitting. I am just wondering if your method could be considered as a baseline in my case, given that the spectral upsampling method you propose is not applicable. You mention in the paper that you noticed some artifacts when this trick was not used, but, does it still work despite that? Or it affects the GAN training stability somehow?
I'm now training a couple of models to bandwidth-extend some piano music (MAESTRO) from a lowpassed version at 1kHz and 3kHz. I'm training with a single lowpass filter, expecting the model to overfit to it. So far, it kind of works but the quality is still not great. I have only trained for half a day, so I'll better wait.
Hi,
Thanks for trying out the model, please keep me updated on the progress!
My model will probably still work without using the upsampling method, you can disable this method via a 1-2 lines in the configuration file. In some experiments I've tried, the model succeeded to remove the artifact at the verge around the Nyquist rate - though it may have taken more training epochs. I'm not sure, but I think that the spectral upsampling may still prevent some artifacts at the verge around the Nyquist rate - even if some frequencies under this threshold are missing.
I got it working, here are some results extending some inputs lowpassed at 1kHz. I think this table was produced after 125k iterations.
I'm not noticing this artifact at the Nyquist rate. But I am testing using the same lowpass filter than training. Is it possible that this artifact is a product of a mismatch between the antialiasing filter used during training and testing?
In a larger test set, using this model I get good results in terms of LSD, but not so good in terms of Frechet Distance (compared to other models I've trained), I wonder why this may happen.
Not sure I understand, are you using the same lowpass filter or a different one?
In my experience, I have seen that the artifact is sensitive to cross-domain adaptation. Meaning that it is more present when inferring samples that were significantly different that the ones the model was trained on.
I'm familiar with Frechet Audio Distance, but have not really delved into its details. Notice that LSD focuses on the quality of the magnitude spectrogram, which means it may not truly represent the true quality of the samples. For example, it does not take into account the phase information of the sample.
The results you show sound really great! Thank you for sharing.
I'm using the same filter. So the model only performs that well for this filter in particular, which is not a realistic case but, anyway, is a nice baseline.
I haven't tried testing on anything outside MAESTRO, but it is interesting that you mainly noticed this problem in out-of-domain samples.
@eloimoliner, is it possible that your model isn't able to generalise because you only used one type of low pass filter for training?
Consider adding the same songs that you previously used in your dataset but with different filters (lr), to then train against the ground truths (hr).
indeed, the model can't generalize. So it is useless in practice.but, nevertheless, it is an interesting baseline.
I got it working, here are some results extending some inputs lowpassed at 1kHz. I think this table was produced after 125k iterations.
I'm not noticing this artifact at the Nyquist rate. But I am testing using the same lowpass filter than training. Is it possible that this artifact is a product of a mismatch between the antialiasing filter used during training and testing?
In a larger test set, using this model I get good results in terms of LSD, but not so good in terms of Frechet Distance (compared to other models I've trained), I wonder why this may happen.
I am not entirely sure what low-pass filter you are referring to. If you want to programmatically apply effects to your audio dataset, I recommend using a tool such as SoX
indeed, the model can't generalize. So it is useless in practice.but, nevertheless, it is an interesting baseline.
This project is not "useless". The pretrained models are great to enhance audio that was originally recorded at a low sample rate or was accidentally downsampled. The "In the spectral domain" part of "Audio Super Resolution in the Spectral Domain" implies that the super resolution is being performed in the frequency domain, not the time domain. That is why Aero is considered a bandwidth extension model.
You'll notice in the examples of the project page that the lr
and hr
both have the same sample rate.
The model can generalize if it is trained properly. To capture frequencies above 11025 Hz, consider training at a higher sampling rate, such as 44100 Hz instead of 22050 Hz. Resample your existing hr
, lr
, and pr
audio tracks from 22050 Hz to 44100 Hz (using resampy or Audacity). Then, use them as your new lr
dataset and use the original MAESTRO uncompressed audio as your new hr
dataset.
Although the highest fundamental frequency on an 88-key piano is 4186 Hz, it is important to consider that piano strings have harmonic and inharmonic overtones which create frequencies above the fundamental:
As you can see, the original maestro dataset contains overtones above 11025 Hz.
Not to mention the added complexity that pianos present, since all but the lowest notes of a piano have multiple strings tuned to the same frequency.
These overtones can cause unwanted artifacts, such as aliasing, when performing Fourier transformations if frequencies above the Nyquist frequency are present.
This is why the Nyquist frequency is also referred to as the "folding frequency".
The Nyquist frequency assumes that the signal is a perfect sine wave sampled at the Nyquist interval/rate.
- Upper left: Animation depicts a sequence of sinusoids, each with a higher frequency than the previous ones. These "true" signals are also being sampled (blue dots) at a constant frequency/rate,
- Upper right: The continuous Fourier transform of the sinusoid (not the samples). The single non-zero component, depicting the actual frequency, means there is no ambiguity.
- Lower right: The discrete Fourier transform of just the available samples. The presence of two components means the samples can fit at least two different sinusoids, one of which is the true frequency (upper-right).
- Lower left: Using the same samples (now in red), the default reconstruction algorithm (i.e. in the absence of collateral information) produces the lower-frequency sinusoid.
You might also want to explore the MSG post-processor and read their paper. They mention that sampling at 44100 Hz not only enhances/cleans the audio, but would also perform bandwidth expansion:
The input audio was peak normalized before passing it
through the network. Since Wavenet operates at 16 kHz
we use this sample rate. We downsampled all systems to
16 kHz so that there was a uniform sample rate across all
separation models. Here, we focus solely on enhancement
and leave the task of bandwidth extension for future work.
Both AERO and MSG use Demucs at their core, but AERO uses a newer/better version of Demucs (hdemucs: Hybrid Demucs v3) than MSG (Demucs v2).
@m-mandel, feel free to correct me if anything I wrote is inaccurate.
@eloimoliner, could you share your wandb training log? I would also like to train on the maestro dataset as well.
Don't misunderstand me, what I meant is that the models I trained are useless, not the models trained by the authors.
Since I trained them with only one filter, the model would just overfit to it and learn to invert the shape of this one single filter. Then, if you want to upsample a signal which was processed with a different antialiasing filter, it would not work nearly as well. A solution would be to randomize the lowpass filters used during training, like a data augmentation strategy. I didn't do it to not bias the results.
The reason I used fs= 22050 Hz is just to save some computation, as there is not much going on above 11025 Hz in MAESTRO.
I see, thank you for clarifying.
You are correct, it would be wise to augment your dataset by applying different filters, introducing additional noise or adding instruments/singers that are playing in harmony with the piano (for the LR). You could also try time-stretching & pitch-shifting.
I might try using the midi in my Digital Audio Workstation and using various Piano Instruments in NI Kontakt to also account for timbre differences between pianos. I once saw/heard a vocal enhancement model that was only trained on a specific accent from the VCTK dataset introduce that language's accent to any vocals that it enhanced.
Depending on how training goes, I might try training a general "Keys" model to include electronic pianos instead of only acoustic pianos.
From what I have noticed, polyphonic instruments are their own beast, models for monophonic instruments tend to perform better.
I looked at the spectrograms of some of the MAESTRO tracks and there overtones above 11025 Hz. Please consider sampling at a the original 44100 Hz so that you model can be used on regular audio and other datasets, like musdb18-hq. Like you said, you want bandwidth expansion, not super resolution.
Are you using cross-validation?
Could you share your training log?
Hi,
Sorry for the late response. Here you can see all the runs, I hope this is what you mean as training log. But be aware that I changed some details in the code, specially related to the data pipeline.
https://wandb.ai/eloimoliner/Spectral%20Bandwidth%20Extension/runs/f2wkai18/overview?workspace=user-eloimoliner
I did not use cross-validation, as objective metrics in bandwidth extension are not very informative anyway. I just trained on the entire MAESTRO train split.
Training on synthetic data like using VST synthesizers is certainly a very interesting idea. Good luck with it!
Thank you for the enhancing (no pun intended) discussion!