Project 2: Generative Audio

Abstract

In this project, I execute a musical "style transfer" from arbitrary audio to the same audio performed by chickens. I follow the approach of A Universal Music Translation Network by Mor et al., which trains and uses a single WaveNet encoder in tandem with multiple WaveNet decoders, with one decoder for each "style" of audio. The encoder takes raw audio and embeds it in a latent space, and each decoder takes the latent embedding and recreates the audio in the timbre of its training background. The authors provide code for their paper here, which I have of course hijacked.

I freeze the published pre-trained encoder (and domain confusion network) and train only a decoder on chicken sounds, on the basis that I am not aiming to translate from the chicken domain, only to. The chicken sounds are obtained from random YouTube videos; as a preprocessing step, I split and remove silences from the audio. Originally I had planned to use Google's AudioSet dataset, which indicates YouTube clips containing sounds of different categories (one of which being "chickens/roosters"), but I found these clips to be too full of other, non-chicken utterances (e.g. humans talking).

emulate the sounds of chickens, composing them in a way that is at least somewhat melodic to the human ear

Future/Alternative Directions

Based on approaches from NSynth, GANSynth, and SynthNet, I could train a synthesizer to produce chicken timbre conditioned on MIDI sequences. In the former cases (NSynth, GANSynth), I would condition the decoder/generator on a pitch vector in addition to the standard latent vector. For inference, I would encode chicken audio, fix the latent conditioning vector, and then feed in pitch vectors according to a MIDI sequence. In the latter case (SynthNet), I would directly train a WaveNet-like architecture to map (MIDI, style label) to waveform. However, if I wanted to train any of these models on chicken audio (perhaps not necessary for NSynth/GANSynth), I would need pitch- or MIDI-annotated squawk waveforms. I might be able to create those annotations using a library such as aubio.
This paper (MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis, with code here), linked by Robert, might be useful. I could follow their approach from Section 3.3 and replace the WaveNet decoder in the translation network with a MelGAN generator. This would make translation incredibly fast in comparison to what I'm currently dealing with. However, somewhat ironically I lack the time to train the model. (The authors mention they train each decoder for four days on a 2080 Ti on MusicNet data, whereas I have less time, worse hardware, and lower-quality data.)
This recent paper (TimbreTron) might yield superior results, but I don't have time to implement it.
For this project, I could of course experiment with other domain translations. There exists a vast space of music and music genres that might be translated into chickensong. But I don't have a MelGAN, so translation is super slow.
I expect that my results can be improved with more data and more training.

Model/Data

You can download the trained model from Google Drive. Place it in the chickensong/music-translation/checkpoints directory. (The resulting folder hierarchy will be chickensong/music-translation/checkpoints/chickenNet.)
To download and preprocess the data, run ./make_dataset.sh <desired data root>.
To download the outdated AudioSet data, get the unbalanced train split and run

python3 dl_train_segments.py unbalanced_train_segments.csv --out_dir <raw wav dir>
for fpath in <raw wav dir>/*.wav; do python3 remove_silences.py ${fpath} --overwrite; done
./preprocess_data.sh <raw wav dir> <processed wav dir>

Code

The training code is in the music-translation submodule.

To generate translated audio, first follow the music-translation setup instructions. Make sure to download the pre-trained models (direct link) and place them in the chickensong/music-translation/checkpoints directory. (The resulting folder hierarchy will be chickensong/music-translation/checkpoints/pretrained_musicnet.) Then run

cd music-translation
./train_decoder.sh <data root>  # OR download pre-trained model
./sample_chickens.sh <preprocessed folder>  # warning: slow as hell

An example of a <preprocessed folder> is musicnet/preprocessed/Bach_Solo_Cello.

Quickstart

git clone --recursive https://github.com/ohjay/chickensong.git
cd chickensong

Follow setup instructions for the music-translation submodule, including downloading the pre-trained models (see Code). Then, depending on whether or not you are using the pre-trained chicken model, you have two options.

If using the pre-trained chicken model

Download the chicken model (see Model/Data). Then run

cd music-translation
./sample_chickens.sh <preprocessed folder>

If training the model yourself

./make_dataset.sh .
cd music-translation
./train_decoder.sh ..
./sample_chickens.sh <preprocessed folder>

Results

I use MusicNet as a source of audio to be translated into chicken. (Note that the pre-trained music-translation models were trained using MusicNet, so there is some precedent.) My chicken model was trained for 200 epochs on about 15 total minutes of unique audio, collected from a hand-curated set of six YouTube videos (see make_dataset.sh for links).

cambini_wind_ft_chicken.wav: chicken rendition of a Cambini wind quintet.
bach_cello_ft_chicken.wav: chicken rendition of a Bach cello solo.

From chicken to classical music

The first direction didn't work very well, so I tried going the other way around: encoding the chicken audio and decoding it using some of the pre-trained instrument decoders. This can be seen as the classical-musical perception of the tunes the chickens are "trying to sing."

beethoven_from_chicken.wav: from chicken (WAV) to a Beethoven string quartet.
violin_from_chicken.wav: from chicken (WAV) to a Beethoven violin.

Autoencoder reconstructions

I include reconstructions of training and validation data given by the denoising WaveNet autoencoder.

Bonus: Chickenspeak

I already had the code for it (note: as a cleaned-up fragment of CorentinJ's excellent voice cloning project), so I figured I'd try performing voice cloning on a chicken. This was the result: chickenspeak.wav. The input text was "kloek kloek kakara-kakara kotek-kotek kokoda guaguagua petok kudak kackel po-kok kuckeliku kokarakkoo kukuruyuk kukeleku quiquiriquic kikiriki" (a collection of onomatopoeias for chickens clucking and crowing). The "voice" was cloned from this video.

Another example. / Speaking actual English (?).

Technical Notes

The code runs on Ubuntu 18.04 with Python 3.6.8.
To get the data, you'll need youtube-dl. You can install it with pip.
Other than that, the requirements are PyTorch, librosa, SciPy, tqdm, etc. Nothing too unusual.

Reflections

As you can hear, this project ended up being more challenging than I had anticipated and the results were not stellar. I attribute this mainly to the fact that my chicken data was not very structured (unlike classical music), and it's inherently a difficult task to translate irregular clucking and background noise into fluid music using current audio-based domain translation methods. This is evidenced by the fact that the autoencoder reconstructions are chicken-like, but the translations are less obviously so. There are quite a few semantic facets of audio to get right in a domain translation: timbre, rhythm, melody, "foreground sound" in the case of my messy chicken data, volume, etc. So perhaps some kind of forced disentanglement (e.g. by conditioning on different aspects of the audio, or by generating each component separately before combining them) would be helpful, to exploit the structure of audio in a learning context.

References

Papers
- A Universal Music Translation Network
Repositories
- music-translation
- Real-Time-Voice-Cloning
Datasets
- MusicNet
- AudioSet