Image Captioning is the automatic generation of natural sentences that describe the contents of a given image. In order to realize this task, has been used Keras with TensorFlow as backend. The data used for the project derives from the Flickr Image dataset, an open dataset with around 30k images of both people and animals obtainable online. Two types of Neural Network have been used: a Convolutional Neural Network to extract features from images, then a Recurrent Neural Network as a decoder. The Recurrent Neural Networks gives as result a probability of match for each word: these probabilities are then used by a greedy algorithm, that given a new image as input, build the caption iteratively, choosing, for each iteration, the word with the higher probability. The achieved results are not perfect, but quite acceptable, the generated captions are always related to the given image, often subject, object and action are recognized very well, but sometimes there are some mistakes. To evaluate the accuracy of the model the BLEU (Bilingual evaluation understudy) metrics and a custom metric based on the number of words in common between the auto-generated caption and the reference captions have been used. Some examples of auto-generated captions (both correct and uncorrect) are listed below.
Some references: Medium, towardsdatascience and machinelearningmastery. For full references and details about the project read this short article.