jina-ai/late-chunking

Implementation of long late chunking

Closed this issue · 3 comments

I noticed that in the implementation code of the long late chunking, line 147 calculates the split indices based on the size of the macro chunk and overlap. Does this step need to consider adding instructions and special tokens such as CLS, EOS, etc., to each macro chunk?

image

The _embed_with_overlap gets as input the token sequence and outputs the sequence of token embeddings. Therefore it receives the full tokenization of the input this includes all special tokens and instructions and can be larger than the about of tokens the model can fit. Then the model get passed a sliding window of the input tokens in each iteration. This means for example that the model only receives a [CLS] input token in the first iteration but not in the the second iteration (as this is only added at the beginning of the sequence). After all token embeddings are calcuated by the function the actual chunking is done outside of the model based on the annotation. Does this answer your question?

This is exactly what I wanted to ask. During multiple iterations, the model receives the complete instruction and CLS token only in the first iteration, and these contents are not received in subsequent iterations. Since the embedding model is stateless during multiple iterations, meaning that the input the model receives is incomplete except for the first iteration. I'm wondering if this understanding is correct?

Yes in this sense it is incomplete. As the tokenizer generates a [SEP] token at the end for most models the first one is incomplete as well in this sense. Also note that the token sequences passed to the model are overlapping, i.e., all but the first macro chunk get a certain number of tokens from the last sequence to not miss this context. The embeddings of those additional tokens are not used at the end.