Deep CNN-LSTM for Generating Image Descriptions
Key words: Image captioning, image description generator, explain image, merge model, deep learning, long-short term memory, recurrent neural network, convolutional neural network, word by word, word embeding, bleu score.
Image captioning is a very interesting problem in machine learning. With the development of deep neural network, deep learning approach is the state of the art of this problem. The main mission of image captioning is to automatically generate an image's description, which requires our understanding about content of images. In the past, there are some end-to-end models which were introduced such as: GoogleNIC (show and tell), MontrealNIC (show attend and tell), LRCN, mRNN, they are called inject-model with idea is give image feature throught RNN. In 2017, Marc Tanti, et al. introduce their paper What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? with merge-model. The main idea of this model is separate CNN and RNN, with only merge their ouput at the end and predicted by softmax layer. Base on it, we develop our model to generate image caption.
- Combine ConvNet with LSTM
- Deep ConvNet as image encoder
- Language LSTM as text encoder
- Fully connected layer as decoder
- End-to-end model I -> S
- Maximize P(S|I)
Flickr 8k, train/val/test 6:1:1.
- Load images and extract feature: kaggle-kernel
- Load text data: kaggle-kernel
- Develop model and training: kaggle-kernel
- Evaluation model: kaggle-kernel
- Generator caption for new images: kaggle-kernel
- VGG16
- Resnet50
- Densenet121
- Inceptionv3
- Adam:
- Nadam:
- RMSprop:
- Sgd:
We use BLEU-score which is evaluate metric:
- BLEU-1: 0.542805
- BLEU-2: 0.301714
- BLEU-3: 0.207351
- BLEU-4: 0.095704