Aim to create an artificial neural network that changes music style, eg) Beatles - hey jude -> hey jude Jazz ver.
https://hyunlee103.tistory.com/80
Because the resolution of the audio separation was bad, we needed a separate sound source for the instrument. We used MUSDB18(https://github.com/sigsep/sigsep-mus-db) to satisfy this.
- CUDA 10.0
- python 3.6.10
- pytorch 1.7.1
- numpy 1.19.2
- opencv-python 4.5.1
You can choose between the time domain and the frequency domain.
python main.py --data_dir 'your datapath'
We tried three models, one in the time domain and two in the frequency domain.
We have applied the CycleGAN model, which shows excellent performance in the style transformation of image domain, to the mel-specrogram in a naive manner. In the process of restoring the specrogram to waveform, the sound source resolution was severely degraded. Moreover, the sound source converted through CycleGAN did not change much from the original. We found the reason in the direction back to self due to cycle loss and only consider the pixel-wise loss due to L1 loss, where the specrogram must achieve structural changes before the style can change. Therefore, we tried waveform instead of specrogram and MelGAN instead of CycleGAN.
MelGAN is a model that reflects the structural loss between the input space of the generator and the generative space through the siamese network. However, since it is a model that applies to spectrogram, we concat input one-dimensional vector waveform axially to create a two-dimensional wave. Through this, not only can the melGAN be applied to the waveform, but also the dilation effect can be expected. This model was not satisfied with the result and we decided to try the autoencoder, not the generative model.
We tried Universal Music Translation(https://github.com/facebookresearch/music-translation) for style transfer rock to jazz piano. While this paper translates musical instruments versus musical instruments such as violin, cello, and piano, we tried to transfer the whole rock music into jazz piano. Because this model is based on wavenet, learning and inference cost is very high.
There is a limit to the application of prior computer vision research due to differences in image and audio data. Due to the high cost of waveNet, it is difficult to increase the resolution of the results and the real-time service seems to be a long way to go. The future direction of research is to identify the data characteristics that affect the music style and create a model that takes those characteristics into account. Also, we need to make low cost models for high-resolution real-time models.
- Musdb18 Dataset
- MUSIC SOURCE SEPARATION USING STACKED HOURGLASS NETWORK(Park et al, 2018 ISMR)
- Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
- WaveNet: A Generative Model for Raw Audio(Deep Mind, 2016)
- META-LEARNING EXTRACTORS FOR MUSIC SOURCE SEPARATION(Samuel et al. 2020)
- MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms(Marco Pasini, 2020)
- A Universal Music Translation Network(Noam Mor el al, 2018)
- Kyojung Koo(https://github.com/koo616)
- Sanghyung Jung(https://github.com/SangHyung-Jung)
- Hyun Lee
@misc{musdb18,
author = {Rafii, Zafar and
Liutkus, Antoine and
Fabian-Robert St{\"o}ter and
Mimilakis, Stylianos Ioannis and
Bittner, Rachel},
title = {The {MUSDB18} corpus for music separation},
month = dec,
year = 2017,
doi = {10.5281/zenodo.1117372},
url = {https://doi.org/10.5281/zenodo.1117372}
}