Latest enamdict files no longer have accented glosses
Closed this issue · 5 comments
Awhile back many of the english glosses in enamdict were updated to use modern accented characters to represent long or accented vowels.
For example, 勘太朗 became Kantarō. This is still true if we look at the entry in the DB: https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&e=2243482
However when downloading the enamdict file from http://ftp.edrdg.org/pub/Nihongo/enamdict.gz all such characters are now rendered normal ascii characters (so Kantarō is rendered as Kantaro). As far as I can tell this happened to all such characters.
This has been the case for at least 2 weeks but less than 3 weeks based on the diff in my weekly dictionary update. (It's just about 80,000 glosses which are effected)
I assume this isn't expected but did something change about how that file is generated recently?
The change was made to cater for some Vietnamese diacritics; converting them to just regular alphabetics. It wasn't supposed to touch things like ō.
I'll need to investigate it - the same XSLT script is used for generating the edict editions and the ō are OK there.
It might be a day or two before it can be resolved.
Thanks! No problems even if it takes a few days!
I've switched back to the previous scripts, so from the next daily generation all those ō/ū names should return. The 10 or so Vietnamese names will be a bit mangled until we get a proper fix.
The xslt scripts that generate the edict/enamdict versions are different for the two dictionaries.
OK, I think it's all fixed, and from tomorrow's distribution, all the diacritics available via JIS coding will be in enamdict. Some, such as the more way-out Vietnamese ones, will just have the plain alphabetics. I think this can be closed now.
I should have mentioned that the great XSLT scripts that turn the JMdict/JMnedict files into the legacy edict/enamdict forms were developed by Jean-Luc Leger. Jean-Luc provided the updates that sorted out this issue.