featreq: when warmstart-training, init weights of new chars from existing ones
bertsky opened this issue · 2 comments
I have the following feature request: Often one needs to finetune a model to add diacritics. Luckily, we can finetune with --warmstart ... --codec.keep_loaded False
. In such cases the actual witnesses of the diacritics are usually still sparse in the GT. So it would likely be helpful if the weights of the additional characters / codepoints could be initialized from those of characters that are similar looking or similar in function. Perhaps as an option --codec.init_new_from_old '["à": "a", "ś": "s" ...]'
...
Great idea! Maybe we could integrate unicode confusables (for example: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=calamari&r=None – data files available as well http://www.unicode.org/reports/tr39/#Data_Files) to automatically choose similar characters from the existing codec? Would be interesting to see how this affects training time and accuracy!
Maybe we could integrate unicode confusables (for example: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=calamari&r=None – data files
Oh, what a nice resource!
to automatically choose similar characters from the existing codec?
I would recommend against that. Those are purely visual confusions – they all have very different semantics. In contrast, what we usually want here is merely slightly different confusions, both visually and semantically. Notice how there are no diacritics in the Unicode confusions, for example. But if you init an a
from an α
or an а
, then you give the system the wrong hints (making inference confusion of these pairs more likely). I would say this is only warranted when the respective old characters cannot reappear together with the new characters anymore (and none of their respective charset).
Another experiment that might be worthwhile beyond the pure initialization: regularize the dense output weights such that these confusables stay close to each other.