Investigate mapping token embeddings from source to target

Question

Investigate mapping token embeddings from source to target

Opened this issue 3 months ago · 0 comments

A recently published paper introduced a strategy called "trans-tokenization", which "focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language." We should investigate whether this approach could improve the performance of adding trained tokens to NLLB.