ImageCaption

Image captioning model in pytorch using Resnet50 as encoder and LSTM as decoder.

Update - important:

The main_update.py has a different approach in training:

LSTM layer takes input and computes the output of length = SEQ_LENGTH (instead of length = 1 as in main.py)
to make this work, the features have dimension (SEQ_LENGTH, BATCH, IMAGE_EMB_DIM) ( instead of (1, BATCH, IMAGE_EMB_DIM)) in order to be concatenated with the emb_captions_batch of size (SEQ_LENGTH, BATCH, WORD_EMB_DIM)

In the checkpoints folder there are weights trained on the new network with the prefix NEW_

Dataset

You can download the images and captions here.
Create a specific folder in ROOT path (for example 'images') for images.
Captions were split 70/30 into train_list.txt and val_list.txt.

How to run this code

You'll need Git to be installed on your computer.

# Clone this repository
$ git clone https://github.com/natasabrisudova/ImageCaption_Flickr8k

I have generated requirements.txt using pipreqs:

> python -m  pipreqs.pipreqs --encoding utf-8 C:/Users/natas/NN_projects/ImageCaption_Flickr8k/code/

Note: I prefer using pipreqs more than pip freeze, as pip freeze saves all packages in the environment including those that you don't use in your current project. However, pipreqs only save the ones you are using in your project.

To install the requirements use:

> pip3 install -r requirements.txt

Vocabulary

To build a vocabulary with word2index and index2word dictionaries, run:

vocab.py captions.txt vocabulary.txt 5000

where the first argument defines the text file from which the vocabulary will be built, the second defines the text file in which the word2index dictionary will be saved and the last defines the vocabulary size (including the 4 predefined tokens: <pad>, <sos>, <eos> and <unk>)

Dataset

Custom class ImageCaptionDataset() holds a list of samples where each sample is a dictionary containing the image file ID and the caption of that image as a list of word indices. The caption is enriched by <sos> and <eos> token at the beggining and end of caption respectively.

The __getitem__ returns an image (preprocessed, as tensor) and the caption as a list of word indices.

The function:

get_data_loader(train_data, batch_size = config.BATCH, pad_index = vocab.PADDING_INDEX)

returns a data loader where each caption in a batch is padded with vocab.PADDING_INDEX which is in this case 0 to reach the length of longest caption in the batch (using pad_sequence)

You can check the data loader by running dataset.py

Model

The whole model consists of 3 parts:

encoder
embeddings
decoder

Running model_updated.py will perform one forward operation of the whole model (with randomly initialized inputs). The results might help you to understand the dimensions of the outputs better.

Encoder

Image encoder is used to obtain features from images. The encoder consists of pretrained Resnet50 model with the last layer removed, and a linear layer with the output dimension of (IMAGE_EMB_DIM).

Embeddings

Embedding layer is used to obtain embedded representation (as a dense vector) of captions of dimension (WORD_EMB_DIM). When training the model, the embedding layer is updated to learn better word representations through the optimization process.

Decoder

Decoder taking as input for the LSTM layer the concatenation of features obtained from the encoder and embedded captions obtained from the embedding layer. Hidden and cell states are zero initialized . Final classifier is a linear layer with output dimension of (VOCAB_SIZE).

(old version: main.py) Note: during the training and evaluation, the dimension of the embedded captions before the concatenation will be (length = 1, BATCH, WORD_EMB_DIM), and the dimension of features will be (1, BATCH, IMAGE_EMB_DIM). The hidden and cell states are initialized to a tensor of size (NUM_LAYER, BATCH, HIDDEN_DIM) where HIDDEN_DIM = IMAGE_EMB_DIM + WORD_EMB_DIM. This approach, however does not bring the full potential out of LSTM. Please check the main_updated.py for better approach, where the whole sentence is concatenated to the features. I keep the code for only educational purposes so that others might learn from my mistakes.

Configurations

For running other files it is necessary to check the config.py and change it accordingly to your wish and situation:

self.DEVICE = torch.device("cuda:0")
        
self.BATCH = 32
self.EPOCHS = 5
        
self.VOCAB_FILE = 'word2index3000.txt'
self.VOCAB_SIZE = 3000
        
self.NUM_LAYER = 1
self.IMAGE_EMB_DIM = 256
self.WORD_EMB_DIM = 256
self.HIDDEN_DIM = 512
self.LR = 0.001
        
self.EMBEDDING_WEIGHT_FILE = 'checkpoints/embeddings-32B-512H-1L-e5.pt'
self.ENCODER_WEIGHT_FILE = 'checkpoints/encoder-32B-512H-1L-e5.pt'
self.DECODER_WEIGHT_FILE = 'checkpoints/decoder-32B-512H-1L-e5.pt'
        
self.ROOT = os.path.join(os.path.expanduser('~'), 'NN_projects', 'ImageCaption_Flickr8k')

If not done already, create specific folder 'checkpoints' and 'saved' in 'code' folder to store weights of trained model and the plot of loss & accuracy respectively.

Training and evaluating data

To train the model run main_updated.py. After training the model you can visualize the results on validation data by running test_show.py. It will show the image along with the title containing real captions, generated captions and the BLEU score (1 and 2).

Captions are generated word-by-word starting with the SOS token. Next predicted word IDs are then being appended for the next LSTM input.

Model with B = 32 and HIDDEN_DIM = 512:

Generating captions on sample image

Run predict_sample.py sample_image.jpg to generate captions on an image (in ROOT path).

Model with B = 32 and HIDDEN_DIM = 512: