I trained a LSTM-convnet (encoder-decoder) neural network architecture to map text input into image output. It works relatively well as can be seen in /examples. Dataset: http://vision.cs.uiuc.edu/pascal-sentences/
Note: this is hackathon level code. To run,
python pascal1k.py # Download Pascal1k to /data/pascal1k
python run.py # To run
There are some examples of other models I've tried in the /models.