/AURAL

AURAL: (Advanced Understanding and Recognition of Audio Logic)

AURAL: (Advanced Understanding and Recognition of Audio Logic)

Audio-Waveform

🎧 From Noise to Harmony: An AI Journey in Audio Enhancement πŸš€

Abstract πŸ“œ

In the rapidly evolving landscape of digital audio, one hurdle consistently surfacesβ€”the enhancement of audio recordings captured with low-quality devices, such as phone microphones. This paper addresses this challenge by presenting a novel deep learning model that is capable of transforming these low-quality audio recordings to simulate the high-quality acoustics of a professional recording studio 🎀.

Our model is trained on an extensive dataset of parallel recordings, each comprising the same performance captured with both a professional studio microphone and a low-quality phone microphone. The objective of our work is to democratize access to high-quality sound recordings, making them accessible to amateur musicians, podcasters, and everyone in between.

The results from our study indicate significant improvements in audio quality, effectively bridging the gap between professional studio recordings and recordings made using everyday devices. This stride in audio quality enhancement paves the way for future research in this area.

Keywords: Audio Quality Enhancement, Deep Learning, Audio Recordings, U-Net Architecture, Generative Adversarial Network

Table of Contents


  1. Related Work
  2. Methodology
  3. Experiments and Results
  4. Discussion
  5. Conclusion and Future Work
  6. References

Introduction πŸšͺ

Acquiring high-quality audio recordings is conventionally a daunting task, necessitating professional equipment and acoustically treated environments. This creates a significant barrier for aspiring musicians and podcasters. To shatter this barrier, we introduce a method to upgrade low-quality audio to studio-like quality using deep learning. Our technique bridges the gap between phone-recorded audio and studio-recorded sound, making high-quality audio more accessible.


Related Work πŸ“š

The arena of audio quality enhancement is a vast and mature field of research. Traditional audio processing techniques have focused on noise reduction strategies, such as spectral subtraction and Wiener filtering [Boll, S. F. (1979), Lim, J. S., & Oppenheim, A. V. (1979)]. The incorporation of psychoacoustic models helped prioritize the preservation of certain sounds during the denoising process, reflecting the nuanced human perception of sound [Zwicker, E., & Fastl, H. (2013)].

However, the dawn of deep learning kindled a fresh direction in audio processing. Researchers have experimented with CNNs, RNNs, and more recently, Transformer-based models for a variety of audio tasks, including sound classification, source separation, and denoising [Pascual, S., Bonafonte, A., & SerrΓ , J. (2017)]. Notably, U-Net architectures have shown promise in audio source separation tasks, excelling in the preservation of detailed features [Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., & Weyde, T. (2017)].

The application of Generative Adversarial Networks (GANs) in the audio domain, specifically WaveGAN, is a testament to the potential of GANs in generating raw audio waveforms [Donahue, C., McAuley, J., & Puckette, M. (2019)]. Similarly, deep learning models, like Transformer-based models, have been effectively employed in automatic music generation and instrument recognition tasks [Huang, C. Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., ... & Chen, D. (2018)].

Our research seeks to enhance these deep learning techniques further by offering an innovative approach to transform low-quality audio into high-quality audio. Our methodology uniquely blends a U-Net based autoencoder with a GAN framework, ensuring the generated high-quality audio is convincingly realistic. This novel combination of techniques represents an unexplored avenue in the realm of audio quality enhancement.


πŸ§ͺ Methodology

Our innovative approach to transforming low-quality audio recordings into high-quality ones involves several key strategies. We construct a dual-stage autoencoder with a U-Net architecture. This model is trained on an extensive dataset of paired low-quality and high-quality audio recordings. The first stage of the autoencoder acts as a denoising autoencoder, targeting the removal of noise and distortion from the low-quality input. The second stage is a generative autoencoder designed to imbue the denoised audio with the characteristics of high-quality studio-recorded audio.

The input to the model is the mel-spectrogram of the low-quality audio, and the target output is the mel-spectrogram of the high-quality audio. The model is trained to minimize the difference between its output and the actual high-quality audio. In addition to the U-Net autoencoder, we adopt the Generative Adversarial Network (GAN) framework to facilitate the generation of more realistic high-quality audio. The generator, which transforms low-quality audio into high-quality audio, is paired against a discriminator that differentiates between authentic high-quality audio and the generator's output. To enhance the performance of the U-Net autoencoder, we condition the model on additional features extracted from the audio, such as pitch or rhythm. This helps our model capture more nuanced characteristics of the sound source. Our model is also trained on auxiliary tasks, including source separation and pitch detection, alongside the primary task of transforming low-quality to high-quality audio. This multi-task learning approach enables our model to learn more robust representations of the audio data.

To diversify our training data and improve the model's generalization, we employ data augmentation techniques like time stretching, pitch shifting, and adding background noise. We leverage transfer learning by initializing our model with weights from a model pre-trained on a related task. This not only accelerates the learning process but also provides our model with a foundation of useful audio features. In situations where paired high-quality and low-quality recordings are limited, we pre-train our model in a self-supervised manner using unpaired data. This involves tasks like predicting the next frame of a spectrogram or masking and then reconstructing parts of the spectrogram. We incorporate psychoacoustic features, such as loudness or sharpness, into our model.

These features provide additional insights into how the audio is perceived by humans, thereby improving the subjective quality of our output audio. Through constant evaluation using objective metrics and subjective listening tests, we iterate and refine our approach, ensuring our model's performance is optimized. We are mindful of the limitations and assumptions inherent in our approach and take these into account during our development process.


πŸ”¬ Experiments and Results

We evaluate our model on a separate test set of low-quality and high-quality recording pairs. Objective measures such as Signal-to-Noise Ratio (SNR) and Perceptual Evaluation of Speech Quality (PESQ) show significant improvement in audio quality. Subjective listening tests also indicate an enhancement in sound quality, with listeners often unable to distinguish our model's output from the actual high-quality recording.


πŸ’‘ Discussion

Our results demonstrate the potential of deep learning for audio quality enhancement. However, the model's performance varies depending on the type of sound source and the quality of the input audio. Further improvements might be achieved with a larger and more diverse training dataset or modifications to the model architecture.


🎯 Conclusion and Future Work

This paper presents a novel approach to enhancing audio quality using deep learning. While our results are promising, there is still room for improvement. Future work could explore different model architectures, training techniques, or feature representations. Our hope is that this work will inspire further research in this area, with the goal of making high-quality sound recordings accessible to all.


πŸ“Œ References

  1. Audio quality enhancement, noise reduction, and traditional audio processing techniques:
    • Boll, S. F. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE transactions on acoustics, speech, and signal processing, 27(2), 113-120. Link
    • Lim, J. S., & Oppenheim, A. V. (1979). Enhancement and bandwidth compression of noisy speech. Proceedings of the IEEE, 67(12), 1586-1604. Link
  2. Psychoacoustic models for sound preservation during denoising:
    • Zwicker, E., & Fastl, H. (2013). Psychoacoustics: Facts and models. Springer Science & Business Media. Link
  3. Deep learning for audio tasks, including sound classification, source separation, and denoising:
    • Pascual, S., Bonafonte, A., & SerrΓ , J. (2017). SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452. Link
  4. U-Net architectures for audio source separation:
    • Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., & Weyde, T. (2017). Singing voice separation with deep u-net convolutional networks. ISMIR, 323-332. Link
  5. Generative Adversarial Networks (GANs) in the audio domain, WaveGAN:
    • Donahue, C., McAuley, J., & Puckette, M. (2019). Adversarial audio synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5680-5689. Link
  6. Deep learning in the music domain for tasks such as automatic music generation and instrument recognition:
    • Huang, C. Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., ... & Chen, D. (2018). Music transformer: Generating music with long-term structure. arXiv preprint arXiv:1809.04281. Link
  7. Transformer-based models for generating symbolic music:
    • Huang, C. Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., ... & Chen, D. (2018). Music transformer: Generating music with long-term structure. arXiv preprint arXiv:1809.04281. Link
AURAL_sound_regeneration/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”‚   β”œβ”€β”€ audio1.wav
β”‚   β”‚   β”œβ”€β”€ audio2.wav
β”‚   β”‚   └── ...
β”‚   β”‚
β”‚   β”œβ”€β”€ preprocessed/
β”‚   β”‚   β”œβ”€β”€ audio1.npy
β”‚   β”‚   β”œβ”€β”€ audio2.npy
β”‚   β”‚   └── ...
β”‚   β”‚
β”‚   └── uncompressed/
β”‚       β”œβ”€β”€ audio1.wav
β”‚       β”œβ”€β”€ audio2.wav
β”‚       └── ...
β”‚
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ noise_reduction/
β”‚   β”‚   β”œβ”€β”€ model1.h5
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ dynamic_compression/
β”‚   β”‚   β”œβ”€β”€ model2.h5
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ frequency_expansion/
β”‚   β”‚   β”œβ”€β”€ model3.h5
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ source_separation/
β”‚   β”‚   β”œβ”€β”€ model4.h5
β”‚   β”‚   └── ...
β”‚   └── frequency_generation/
β”‚       β”œβ”€β”€ model5.h5
β”‚       └── ...
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ data_preprocessing/
β”‚   β”‚   β”œβ”€β”€ load_audio.py
β”‚   β”‚   β”œβ”€β”€ normalize_audio.py
β”‚   β”‚   β”œβ”€β”€ convert_to_mono.py
β”‚   β”‚   └── resample_audio.py
β”‚   β”‚
β”‚   β”œβ”€β”€ model_training/
β”‚   β”‚   β”œβ”€β”€ train_noise_reduction_model.py
β”‚   β”‚   β”œβ”€β”€ train_dynamic_compression_model.py
β”‚   β”‚   β”œβ”€β”€ train_frequency_expansion_model.py
β”‚   β”‚   β”œβ”€β”€ train_source_separation_model.py
β”‚   β”‚   └── train_frequency_generation_model.py
β”‚   β”‚
β”‚   β”œβ”€β”€ model_evaluation/
β”‚   β”‚   β”œβ”€β”€ evaluate_noise_reduction_model.py
β”‚   β”‚   β”œβ”€β”€ evaluate_dynamic_compression_model.py
β”‚   β”‚   β”œβ”€β”€ evaluate_frequency_expansion_model.py
β”‚   β”‚   β”œβ”€β”€ evaluate_source_separation_model.py
β”‚   β”‚   └── evaluate_frequency_generation_model.py
β”‚   β”‚
β”‚   β”œβ”€β”€ audio_regeneration/
β”‚   β”‚   β”œβ”€β”€ apply_noise_reduction.py
β”‚   β”‚   β”œβ”€β”€ apply_dynamic_compression.py
β”‚   β”‚   β”œβ”€β”€ apply_frequency_expansion.py
β”‚   β”‚   β”œβ”€β”€ apply_source_separation.py
β”‚   β”‚   β”œβ”€β”€ apply_frequency_generation.py
β”‚   β”‚   └── mix_and_master.py
β”‚   β”‚
β”‚   └── main.py
β”‚
└── README.md

πŸ§‘β€πŸ’Ό Authorship

This work is conducted by: