bigcode-project/bigcode-encoder

Pooling question

milyenpabo opened this issue · 1 comments

I'm running some tests with StarEncoder, and I'm using your code as a starting point. When returning an embedding, you pool input token embeddings into a single vector in here:

def pooling(x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:

As I read the code, you simply pick the last valid (non-masked) token's embedding as the pooled embedding vector for the entire sequence. This should be the vector corresponding to the <sep> separator token, if I get it correctly.

Can you explain why you do this? Is this something similar to CLS-pooling from BERT? Do you think this leads to better results than other approaches (e.g., mean-pooling)?

Hello! yes, you did get it correctly and we take the output at the [SEP] in the end of the input as an embedding. Besides that, I tried both the output at [CLS] as well as mean pooling without special tokens. The output at [SEP] was the best performing approach by far in a code-to-code search task, so that's why it was kept. However, given a new task, I would try at least those three approaches and compares results.