Image captioning is a task which lies in the intersection of areas of object detection and natural language processing. We will be proposing a, model which will be utilizing both the areas of CV and NLP for the automatic generation of the captions of the given image. Model that we are going to propose mimics the human visual system that automatically describe image content. Main idea of our model is that rather than focusing on the whole image it is better to focus on particular areas like the areas where objects are present in the image. Our model consists of two sub model, first sub model or an encoder consist of object detection part which is used to identify the object in the given image along with their spatial location and finally making annotation vector consist of object features and their spatial feature. Second sub model or decoder consist of RNN based LSTM network along attention network which produce a context vector based on annotation vector at a particular time and finally at each step LSTM takes input of attention network along with the other input to generate caption of a given image. Experimental result on the MSCOCO dataset shows that our model outperforms previous benchmark models.
Ziaf/Neural-Image-Captioning-with-Object-Detection-and-Attention-Mechanism
Image captioning is a task which lies in the intersection of areas of object detection and natural language processing. We will be proposing a, model which will be utilizing both the areas of CV and NLP for the automatic generation of the captions of the given image. Model that we are going to propose mimics the human visual system that automatically describe image content. Main idea of our model is that rather than focusing on the whole image it is better to focus on particular areas like the areas where objects are present in the image. Our model consists of two sub model, first sub model or an encoder consist of object detection part which is used to identify the object in the given image along with their spatial location and finally making annotation vector consist of object features and their spatial feature. Second sub model or decoder consist of RNN based LSTM network along attention network which produce a context vector based on annotation vector at a particular time and finally at each step LSTM takes input of attention network along with the other input to generate caption of a given image. Experimental result on the MSCOCO dataset shows that our model outperforms previous benchmark models.