Investigate mapping token embeddings from source to target
Opened this issue · 0 comments
mshannon-sil commented
A recently published paper introduced a strategy called "trans-tokenization", which "focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language." We should investigate whether this approach could improve the performance of adding trained tokens to NLLB.