Experience in Fine-Tuning a Cross-Encoder Model
nguyenvannghiem0312 opened this issue · 4 comments
I am currently working on a project involving model retrieval, and I plan to use a bi-encoder + cross-encoder approach for retrieval. However, I have encountered an issue while training the cross-encoder model.
Specifically, when using only a pretrained model that hasn’t been fine-tuned on my domain-specific data, the cross-encoder improves performance by 5% after applying the bi-encoder. However, after fine-tuning the bi-encoder model, the performance improves significantly, but when I pass it through the cross-encoder, the results deteriorate substantially—by around 10%.
I tried fine-tuning the cross-encoder using a 1:3 ratio of positive to negative samples, with the negative samples being retrieved from the bi-encoder model. Could I ask for valuable advice from everyone on training a cross-encoder model?
In my experience, negative sampling is very important when training a cross encoder. You said "I tried fine-tuning the cross-encoder using a 1:3 ratio of positive to negative samples, with the negative samples being retrieved from the bi-encoder model." I used the same method, but let me share my experience. The bi-encoder used for negative sampling used BAAI/BGE-m3, and the harder the negative sampling, the better the performance. That is, sampling between 0.5-0.8 and the similarity between the anchor and the negative showed higher performance than sampling at 0.3-0.5. (In my case, the optimal similarity for negative sampling was 0.5-0.8.) Also, easy negatives (random negatives) actually worsened the performance. With this in mind, I recommend that you train the cross-encoder by sampling only hard negatives with a cosine similarity of 0.5-0.8 when negative sampling. Additionally, the performance was higher when the number of negatives was 8 than when it was 4. Therefore, it is recommended to increase the number of negatives and try training.
Very valuable insight, thanks @daegonYu.
I've also heard that some model authors add a small amount of "random negatives" beyond the hard negatives. Otherwise the model can get a bit "lost" when it comes to easy cases, as it was only trained on really hard cases.
I thought this was an interesting paper that also mentions the random negatives: https://huggingface.co/papers/2411.11767 (Section 6: Discussion > Negatives in Training)
Personally, I don't have too much experience with Cross-Encoders yet, although I expect to soon be working on CE's a lot more.
- Tom Aarsen
Thank you @daegonYu and @tomaarsen for your response,
Thanks to your advice, I’ve gotten some results from my experiments. When I used only hard negatives from the top-3-5-7 of the bi-encoder, my cross-encoder model struggled to learn and gradually deteriorated. So, I tried a new approach: selecting the top 3 negatives with scores below 0.85 and combining them with 4 random negatives (a 1:7 ratio). Additionally, I ensured that each batch (a multiple of 8) adhered to this ratio. Surprisingly, this was the first time the experiment showed promise (MRR@10 improved from 54% to 66%).
However, the results were still not great, especially since the bi-encoder had already reached 68%. Honestly, I wasn’t expecting the cross-encoder alone to deliver great results. Then, I visualized the results from both the bi-encoder and the cross-encoder and realized they should be combined rather than used independently. So, I combined them using a 2:8 ratio (0.2 * ranker + 0.8 * embedding), and my MRR@10 improved to 75%.
I think that in my domain, the cross-encoder struggles to perform well, or perhaps my model isn’t large enough (the cross-encoder I used has only 100M parameters). Alternatively, I might need a more advanced training technique. Either way, combining the scores has given me quite impressive results.
I'm happy to hear that performance has improved!
It is noteworthy that the performance did not improve when using the hard negatives of top-3-5-7, but then when using the data of the top 3 negatives with scores below 0.85 and combining them with 4 random negatives (a 1:7 ratio), the performance improved.
It might be a good idea to experiment with increasing the model size and changing the ratio of hard to random negatives.