This is a Keras implementation of a variational version of the baseline spectral autoencoder described in google deepmind paper .
You can have a look at our results here:
We used a subset of the public Nsynth dataset composed by brasses and flutes. We got the log-magnitude spectra of each audio and we used them as input/target during training process. As mentioned in the original article we used Griffin & Lim algorithm to reconstruct the phase of each signal.
We implemented a variational version of the baseline autoencoder to see if a meaningful audio generation was possible in this case. In order to reduce the huge number of parameters of the original model we achieved a dimensionality reduction of the filters with respect of it. Even in this case the phase was reconstructed using Griffin & Lim algorithm
The code is released under the terms of MIT license.