Experience in Fine-Tuning a Cross-Encoder Model

Question

Experience in Fine-Tuning a Cross-Encoder Model

nguyenvannghiem0312 opened this issue a month ago · 4 comments

nguyenvannghiem0312 commented a month ago

I am currently working on a project involving model retrieval, and I plan to use a bi-encoder + cross-encoder approach for retrieval. However, I have encountered an issue while training the cross-encoder model.

Specifically, when using only a pretrained model that hasn’t been fine-tuned on my domain-specific data, the cross-encoder improves performance by 5% after applying the bi-encoder. However, after fine-tuning the bi-encoder model, the performance improves significantly, but when I pass it through the cross-encoder, the results deteriorate substantially—by around 10%.

I tried fine-tuning the cross-encoder using a 1:3 ratio of positive to negative samples, with the negative samples being retrieved from the bi-encoder model. Could I ask for valuable advice from everyone on training a cross-encoder model?

Answer 1 · 2024-12-02T01:45:34.000Z

In my experience, negative sampling is very important when training a cross encoder. You said "I tried fine-tuning the cross-encoder using a 1:3 ratio of positive to negative samples, with the negative samples being retrieved from the bi-encoder model." I used the same method, but let me share my experience. The bi-encoder used for negative sampling used BAAI/BGE-m3, and the harder the negative sampling, the better the performance. That is, sampling between 0.5-0.8 and the similarity between the anchor and the negative showed higher performance than sampling at 0.3-0.5. (In my case, the optimal similarity for negative sampling was 0.5-0.8.) Also, easy negatives (random negatives) actually worsened the performance. With this in mind, I recommend that you train the cross-encoder by sampling only hard negatives with a cosine similarity of 0.5-0.8 when negative sampling. Additionally, the performance was higher when the number of negatives was 8 than when it was 4. Therefore, it is recommended to increase the number of negatives and try training.

Answer 2 · 2024-12-02T10:38:40.000Z

Very valuable insight, thanks @daegonYu.
I've also heard that some model authors add a small amount of "random negatives" beyond the hard negatives. Otherwise the model can get a bit "lost" when it comes to easy cases, as it was only trained on really hard cases.

I thought this was an interesting paper that also mentions the random negatives: https://huggingface.co/papers/2411.11767 (Section 6: Discussion > Negatives in Training)

Personally, I don't have too much experience with Cross-Encoders yet, although I expect to soon be working on CE's a lot more.

Tom Aarsen

Answer 3 · 2024-12-03T04:19:04.000Z

Thank you @daegonYu and @tomaarsen for your response,

Thanks to your advice, I’ve gotten some results from my experiments. When I used only hard negatives from the top-3-5-7 of the bi-encoder, my cross-encoder model struggled to learn and gradually deteriorated. So, I tried a new approach: selecting the top 3 negatives with scores below 0.85 and combining them with 4 random negatives (a 1:7 ratio). Additionally, I ensured that each batch (a multiple of 8) adhered to this ratio. Surprisingly, this was the first time the experiment showed promise (MRR@10 improved from 54% to 66%).

However, the results were still not great, especially since the bi-encoder had already reached 68%. Honestly, I wasn’t expecting the cross-encoder alone to deliver great results. Then, I visualized the results from both the bi-encoder and the cross-encoder and realized they should be combined rather than used independently. So, I combined them using a 2:8 ratio (0.2 * ranker + 0.8 * embedding), and my MRR@10 improved to 75%.

I think that in my domain, the cross-encoder struggles to perform well, or perhaps my model isn’t large enough (the cross-encoder I used has only 100M parameters). Alternatively, I might need a more advanced training technique. Either way, combining the scores has given me quite impressive results.

Answer 4 · 2024-12-03T04:53:14.000Z

I'm happy to hear that performance has improved!

It is noteworthy that the performance did not improve when using the hard negatives of top-3-5-7, but then when using the data of the top 3 negatives with scores below 0.85 and combining them with 4 random negatives (a 1:7 ratio), the performance improved.

It might be a good idea to experiment with increasing the model size and changing the ratio of hard to random negatives.