locuslab/TCN

Recommendation for image to text

addisonklinke opened this issue · 5 comments

My goal is to train a model that can output sequences of text from image inputs. Using the IAM handwriting dataset for example, we would pass the model an image

image

and expect it to return "broadcast and television report on his". Historically, the common (i.e. recurrent) way to accomplish this would be an encoder (CNN) + decoder (LSTM) architecture like OpenNMT's implementation. I am interested in replacing the decoder with a TCN, but am unsure how to approach the image data. The CNN encoder will create a batch of N features maps with reduced spatial dimensions (H', W')

image

The issue is a TCN expects 3D tensors (N, L, C) whereas each "timestep" of the image is 2D (N, H, W, C). Following the p-MNIST example in the paper, we could flatten the image into a 1D sequence with length H' x W'. Then the TCN would effectively snake through the pseudo-timesteps like below

image

However, if we want one prediction per timestep it makes much more sense to define a left-to-right sequence instead of a snaking one since that's the direction the text is depicted in the image. Did you experiment at all with image to text models, and if so, how did you chose to represent the images?

I also wonder about the loss function for training a TCN decoder. Assuming you divide the image width into more timesteps than your maximum expected sequence length, it seems like connectionist temporal classification (CTC) would be a good choice. Then you do not have to worry about alignment between the target sequence and model's prediction. For instance, "bbb--ee-cau--sssss----e" would be collapsed to "because" by combining neighboring duplicates and removing blanks. Do you agree or is there a different loss function you would suggest?

You should be able to use TCN as a direct replacement of the LSTM interface (so if you "snaked through" the image with an LSTM, you can also do it with a TCN). However, since the hand-writing images are in left-to-right order, I would start with a bunch of 2D convolutions that collapse the "height" dimension. In particular, given an image of shape HxWxC, you can probably perform a series of Conv2D(in_chan, out_chan, kernel_size=3, stride=(2,1), padding=1), so that after a few such downsamplings your hidden unit will be of shape 1xWxC. You shall then be able to pass it to a Conv1D sequence model like TCN (as the width is now the length).

Since alignment will be important in your particular application, I think CTC would be a good choice. However, you can also always just do it the traditional way: compress the image to a single vector, and pass it to TCN one time step at a time. (And to answer your question: no, I have not experimented with image-to-text model(s).)

Let me know if this helps!

@jerrybai1995 Thank you for the quick response! I see your point about collapsing the height to 1 via convolutions, however for images with a more square aspect ratio than my example I do not think this would be feasible. At the same time the height shrinks, you reduce the width (i.e. sequence length) dimension as well. If that becomes shorter than your maximum target sequence length, then you do not have enough timesteps to make a full prediction and the CTC loss would not apply.

A couple options seem like

  1. Convolve until the width is at the minimum acceptable value (relative to target lengths), and then concatenate the height dimension into one long vector
  2. Same as above, but have some simple linear layer(s) map the column features into a single dimension
  3. Selectively upsample the width after the height has been reduced to one - although I am not really sure how you implement this with convolutions and it would probably distort the feature maps

Any of those sound better than the others?

Also, in OpenNMT's recurrent architecture the decoder LSTM does "snake" through the image in a sense. However it is an attention mechanism, so at each timestep of the decoder, it can choose to focus selectively on different rows of the feature volume. Correct me if I am wrong, but I do not think the TCN needs anything like attention because it has access to all the timesteps at once whereas the decoder LSTM has to handle them one-by-one?

Why is the width reduced as the height shrinks? As I mentioned, if you take stride=(2,1) (i.e., stride 2 on height, stride 1 on width), the width dimension will NOT be changed but the height dimension will reduce by a factor of 2 for each layer.

This should save you from the three proposed options, I think.

I see now, sorry I glossed over your stride specification

What are your thoughts about the attention mechanism?

Also, in OpenNMT's recurrent architecture the decoder LSTM does "snake" through the image in a sense. However it is an attention mechanism, so at each timestep of the decoder, it can choose to focus selectively on different rows of the feature volume. Correct me if I am wrong, but I do not think the TCN needs anything like attention because it has access to all the timesteps at once whereas the decoder LSTM has to handle them one-by-one?

I also found a paper implementing a similar conv2conv architecture (except for video captioning). They used an attention mechanism discussed on pg. 5, and their decoder seems quite similar to a TCN concept. The main addition looks like the temporal deformations in their encoder, however they don't have an ablation study as substantial as your paper to show whether those are really necessary or simple convolutions could suffice

But still, in the decoding process you would want to do things in a generative, one-by-one way: you first generate t=1, and then using that generation you can do t=2, and so forth. Although a TCN does have access to all the time steps, at generation time you do not have all time steps--- you still need to generate them (so t=1 first, and then t=2, etc.)