Audio to image (PNG) for Convolutional Neural Network training
The goal of this project is to train a convolutional network with images of sounds, an image is an array of audio bits represented by RGB colors.
-
Question 1: Is an image based convolutional network able to predict / generate speech regardless of things like accent?
-
Question 2: Is an image based convolutional network able to predict / generate sounds (like: drum stick, cow, etcetera)?
-
Goal 1: recognize the sounds (method: unknown yet).
-
Goal 2: synthesize sounds (method: StackGAN).
-
Note I: A sound can be anything, in this example spoken word is used in the from of nouns.
-
Note II: For now the software handles 16bit, 44khz, mono (single channel) files with a max of one second (= 44100 samples).
-
Note III: This is an experiment, any form of feedback is welcome
Running examples
From WAV to PNG
$ python convertWavToPng.py input-wav/battle.wav output.png
from PNG to WAV
$ python convertPngToWav.py output.png output.wav
Step I: Audio to PNG
Every sample is converted into an RBG value.
- Sample rate 16bit = −32,768 to +32,767
- Sample rate converted in positive integers: `sample rate + 32768
- Sample rate converted into hex color
- Hex color converted into RGB value.
- Pixel is set to RBG color.
- Pixel is set on a
math.sqrt( 44100 )
frame (210x210)
File: convertWavToPng.py
Step II: PNG to audio
Step one is reversed, every pixel is converted into a sample rate.
File: convertPngToWav.py
Results:
Noun | Audio Input | Converted PNG | Reverted Audio |
---|---|---|---|
Battle | WAV | WAV | |
Broker | WAV | WAV | |
Calculator | WAV | WAV | |
Cloth | WAV | WAV | |
Collection | WAV | WAV | |
Guy | WAV | WAV | |
Lyric | WAV | WAV | |
Miscommunication | WAV | WAV | |
Protocol | WAV | WAV | |
Trainer | WAV | WAV |
Training model
To Do
StackGAN model
To Do