Language codes not following ISO standards in lang-id-voxlingua107-ecapa
FredHaa opened this issue · 2 comments
FredHaa commented
Describe the bug
In the tokenizer, Hebrew is given the language code 'iw' and Javanese is given 'jw'. These language codes are incorrect according to the ISO standard
'iw' has been obsoleted and changed to 'he' and 'jw' was an error from the first version of the ISO639 standard, and has latter been corrected to 'jv'.
Expected behaviour
That the correct language codes for Hebrew and Javanese were returned.
To Reproduce
No response
Environment Details
No response
Relevant Log Output
No response
Additional Context
No response
asumagic commented
Seems like the problem originates from the VoxLingua107 dataset itself, though it can be corrected in the label_encoder.txt
file used at inference.
Considering changing this on existing models could break existing code, adding a notice in the relevant documentations would make sense.