Author:
Anton Lechuga
Project Type:Beat the classics | Bring your own data
DomainAudio Processing | Music Generation
The goal is to build a model that that generates Electronic Dance Music (EDM) indistinguishable from human creators. In order to limit the scope of the project, I decided to concentrate on the sub-genres of Industrial and/or Acid Techno, which are very similar in structure, melody and rhythm.
This project is a part of the course 'Applied Deep Learning' at the Technical University Vienna, in which I want to investigate to what degree it is possible to downscale current state of the art architecture.
Note
The chosen procedure greatly differs from my initial plans. Some of the major changes are:
- Using raw audio data as input (good MIDI data was difficult to obtain)
- Using Vector Quantized Variational Autoencoder for generating music embeddings (enormous input sizes with raw audio)
Installation of the required dependencies using Python 3.9 onwards with
pip install -r requirements.txt
The codebase in designed for training of a model end-to-end after specifying the directory with the audio file location.
For running the training loop, run:
python3 training/train.py --config config/<config_name>.yml
For running the training loop, run:
python3 app/app.py
My approach consists of two main stages, since I did not operate on an existing dataset. Hence, I first implemented a pipeline to process audio files for training a model, which will be explained in the last part of this section.
Warning
File sizes can grow very quickly when working with hight quality audio data!
While I trained my model on freely available techno tracks downloaded from Soundcloud, the pipeline also works on any other audio data which is why I will explain the procedure more generally in the following.
├── 📂 dataset
│ ├── 📂 data
│ │ ├── 📜 techo_<sample_rate>.h5
│ │ │ ...
│ ├── 📜 data_generator.py
│ ├── 📜 data_handler.py
│ └── 📜 file_processor.py
The pipeline is designed to work on any audio data, currently supported file types are .wav
and .mp3
extensions. From all audio files found in a specified directory, all files will be sliced and stored automatically within the /data
folder where the data handlers operate on (see section 3 for more details ). Classes in data_handler
are implemented to bridge between the stored data and the pytorch data operators defined in data_processor
which handle all relevant functionalities needed for preparing the input for training a model.
One major difficulty was to compromise between file sizes and data quality, as good quality audio data in .wav
form quickly goes to gigabytes when training on many songs. Until now, I chose quality over quantity and only used good quality audio at high sample rates at 44 100 kHz and each training sample had a length of 8 seconds. However, these preprocessing steps are fully automated and can quickly be adjusted during training. The file handler will the search if a file already exists for the specifications and will generate a new data file if not.
I implemented a Vector Quantized Variational Autoencoder (VQVAE) in pytorch inspired by [1] and [2] in order to reduce the dimensionality of input. The implementation is done similar to the shown graph in [3] where I added some residual blocks in both, the encoder and decoder block to ensure information flow. The embedding dictionary (bottleneck) in implemented in vectorQuantizer
where no gradients pass through. During training, model parameters are stored in the parameters
directory. Losses are also directly implemented into the model.
├── 📂 model
│ ├── 📂 vqvae
│ │ ├── 📜 vqvae.py
│ │ ├── 📜 resid.py
│ │ ├── 📜 encoder.py
│ │ ├── 📜 decoder.py
│ │ ├── 📜 vectorQuantizer.py
│ │ ├── 📂 parameter
│ │ │ ├── 📜 <model_name>.pth
Note
- At the time of the submission, I still wanted to test out some other configurations which is why the setup is still focussed on training models with various configurations of data splits and model/training configurations. When fixed, constant parts of the configuration will be directed to its corresponding folders in data/training/model.
- In order to run the training, I used
wandb
for logging. When first running the training, you will be asked to sign in to wandb and create a (free) account.
All already illustrated in the first section, all operations needed for training can be directly asses during training. I therefore organized all needed information about data, training and model parameters in a hierarchical configuration file until a final model is found. While training, I initially targeted to be able to generate music directly from the model which I wasn't able to successfully deploy until now.
Hence, for this milestone I will use the defined loss and its components to asses the quality of my model, which alternatively can also be used for generating music embeddings as input for auto-regressive models such as transformer-based architectures. For this model, the loss is defined by a weighted some of three different components where the weights themselves are hyperparameters:
Hereby, the reconstruction loss is further divided into two terms assessing a one-dimensional difference between the input and reconstruction on the one hand and a spectral difference of their spectrograms on the other hand. Since the codebook loss measures the difference between the encoded sound and the codebook vectors, it is also an important measure to keep track of as average closeness between both vectors might be an indicator for better ability to generate sounds as well. Hence, during training I kept track of all the metrics mentioned in this paragraph.
For the task at hand, my simple baseline had many problems with vanishing or exploding gradients barely getting under a validation set loss of
- Quantitative Assessment:
$L<0.25$ - Qualitative Assessment: Reconstruction is subjectively close to input audio
- Bonus: Generating something what can be considered sound from codebook vectors
As of now, I trained, validated and tested the models each on a different subset of songs and their corresponding splits. I trained three main models with different sizes and settings, of which the smallest baseline
model could also be trained on my local M1 Mac even at a sample rate of big_compression
only extended the baseline by compressing the audios to the latent space at a factor of
Reconstruction Quality | Codebook Vector Closeness |
---|---|
Plotting two of the main loss components over time, the small baseline still performs very well compared to the much bigger models using only a fraction of codebook vectors. However, one needs to be cautious as I only trained on a total of 20 songs and roughly
Model | Loss |
---|---|
Baseline | 0.105 |
Compression | 0.4345 |
Compression + Codebook | 0.07521 |
Finally, for the qualitative assessments I placed a working example in docs/wav/
where a reconstructed sound is compared to its original state. I did not include a file of a generated sound since up to this point, it is still mainly comprised of randomly sounding noise. However, the reconstruction process seems to work a lot better.
"I expect the project to be a lot more time consuming as indicated by the ECTS, which is however a circumstance I am willing to take." - In line with these words from the first milestone, the project turned out to be very time-consuming with many difficult choices to make. I will however remain working on the problem, especially since the pipeline and model building was so intensive, but insightful and major progresses only started over the last week. Some of which are:
- Optimizing data processing without storing intermediate data files
- Increasing training data by sliding through data instead of creating independent splits
- implementing attention layer as techno music has many reoccurring themes
- Running training on more stable environment compared to google collab
- Spectrogram -> Mel-Spectrogram
- Optional: Depending on success of VQVAE generation, implement simple transformer for generating sound based on codebook vectors
Task | Hours | Deadline |
---|---|---|
dataset collection | 5 | 31.10.2023 |
Model Design & Building | ~30 | 25.11.2023 |
Training & Fine-Tuning | >30 | 15.12.2023 |
Application Building | 10 | 10.01.2024 |
Final Report | 5 | 15.01.2024 |
Presentation Preparation | 5 | 15.01.2024 |
And of course, I am open and very grateful for any feedback and hints how to improve the pipeline and overall approach of the project!
[1] Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A., & Sutskever, I. (2020). Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341.
[2] Défossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
[3] Ding, S., & Gutierrez-Osuna, R. (2019, September). Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion. In Interspeech (pp. 724-728).