Audio to image (PNG) for Convolutional Neural Network training

The goal of this project is to train a convolutional network with images of sounds, an image is an array of audio bits represented by RGB colors.

Question 1: Is an image based convolutional network able to predict / generate speech regardless of things like accent?
Question 2: Is an image based convolutional network able to predict / generate sounds (like: drum stick, cow, etcetera)?
Goal 1: recognize the sounds (method: unknown yet).
Goal 2: synthesize sounds (method: StackGAN).
Note I: A sound can be anything, in this example spoken word is used in the from of nouns.
Note II: For now the software handles 16bit, 44khz, mono (single channel) files with a max of one second (= 44100 samples).
Note III: This is an experiment, any form of feedback is welcome