kermitt2/delft

Implement sliding window

Opened this issue · 6 comments

I thought it might be better to discuss the sliding window in a separate issue.

#44 (comment)

I was just considering whether we need sliding windows to not have to use a really large max_sequence_length.

#44 (comment)

As you can see, it's not related to a sliding windows as we have in CRF. With CRF, we always have a contextual window centered on the current "target" token which will be labelled, and the CRF template is used to determine the size of the window.

With the DL approach, we have a prediction on a complete sequence, without sliding window and the whole sequence is involved when training the weights or outputting something. For very large input sequence like the header model, it's of course an issue (size of the input could be more than 1000 tokens in worst cases) - but it's potentially also where it is interesting because the "recursive" aspect of the RNN makes the backpropagation potentially impacting the complete sequence.

It would be indeed interesting to compare the "traditional" global full sequence network approach and a local sliding-window network, though I am not sure how to do it. It would require some review to see how it was approached by other people.

/cc @kermitt2 @lfoppiano

The TF 2.0 RNN has a stateful flag which might allow one to pass in the input sequence in consecutive windows (without overlap). Although we really only would want it to be stateful within the same document. Keras also has a stateful LSTM example with a rolling window.

I have now implemented something in my version: elifesciences/sciencebeam-trainer-delft#179

I tried the stateful version (which is still an option) but the training time increases by roughly eight times. So I never got to the end of it. I then instead used stateless windows. I haven't tested it in anger. It first seemed promising but then in that run the version without windows performed better. Followed by the one with overlapping windows. I am guessing that is mainly because the median token length is around 1000 and a window size of 3000 would have seen most variations. It will be good to test it again with a shorter window size. The major contributor to not so great end-to-end reference extraction appeared to be the line numbers.

(I also later realised that the segmentation model isn't using tokens, therefore the embedding won't make much sense)

Maybe we could imagine 2 segmentation models, one for the initial window and one for the follow-up window(s) with some sort of overlapping.

About the segmentation model, it works at line-based, and it contains as lexical features the two first words of the line (I tried 1 word, 3 words, adding 1 or 2 last words of the line... just 2 first words appeared to be enough with the current limited training data). So the embeddings still make sense for these "full" lexical features. But it also means with 2 tokens to concatenate vertically embeddings for 2 tokens, which will impact negatively the memory and window size.

Thank you for explaining that. I didn't realise that it was the first two tokens. In any case I am now training a segmentation model without word embedding at all to see whether characters + features might be enough.

It will be worth to look at the very long sequences to see whether that is actually good data. (I haven't done that yet).

I am suspecting that the sliding windows should become more in handy for the fulltext model as I believe that is token based like the header model? (and should therefore have longer sequences)

One example, 025544v1, where the PDF is currently resulting in many tokens (14798). Those tokens seem to come from figures. But the model (trained on at most 3000) still seem to be okay (of course those characters themselves don't make much sense).

Perhaps interesting on the topic. I have now also implemented sliding windows at prediction / tagging time as it was running out of memory for some documents (I am not using it for training because it's much slower).

Here is an evaluation of three options:

image

This is over 200 validation documents.
The 2nd option / model still failed to convert one of the examples which makes it results slightly unreliable.

All of the models use the same trained DL models for the segmentation, header, reference segmentation and citation model (with a max sequence length of 3000).

Something that this chart might show:

  • Just truncating at 3000 tokens impacts reference extraction (there seem to be enough documents over that threshold)
  • Sliding windows of 1000 has a similar performance to 3000 (with the above caveat)

I implemented the sliding windows by making the model stateful (in the same way as I had tried that before for training). That can't be parallelised (but then it had run out of memory otherwise)

EDIT: I haven't tested it properly and now realised that stateful is also very slow at prediction time. Instead I enabled this for stateless models (i.e. any model), with a short overlapping context window by setting the input window stride to a lower value than max sequence length. That seems to work reasonably well.