Pooling question
milyenpabo opened this issue · 1 comments
I'm running some tests with StarEncoder, and I'm using your code as a starting point. When returning an embedding, you pool input token embeddings into a single vector in here:
Line 152 in 10ace39
As I read the code, you simply pick the last valid (non-masked) token's embedding as the pooled embedding vector for the entire sequence. This should be the vector corresponding to the <sep> separator token, if I get it correctly.
Can you explain why you do this? Is this something similar to CLS-pooling from BERT? Do you think this leads to better results than other approaches (e.g., mean-pooling)?
Hello! yes, you did get it correctly and we take the output at the [SEP] in the end of the input as an embedding. Besides that, I tried both the output at [CLS] as well as mean pooling without special tokens. The output at [SEP] was the best performing approach by far in a code-to-code search task, so that's why it was kept. However, given a new task, I would try at least those three approaches and compares results.