tshatrov/ichiran

Katakana proper nouns are being split up

Closed this issue · 1 comments

I'm using Ichiran is because it is, by far, the best parser/tokenizer at when it comes to reasonable word boundaries in Japanese. It is so awesome! I have noticed, however, that there is a single area where it seems to underperform, and that is with katakana names. The behavior I see from other tokenizers (which I think would be the generally desired behavior), would be to provide the names/words possible. It appears to me that Ichiran might be finding the shortest. Here's an example.

https://ichi.moe/cl/qr/?q=%E3%81%8A%E3%81%AF%E3%82%88%E3%81%86%E3%80%81%E3%83%95%E3%83%AC%E3%83%83%E3%83%89&r=htr

I would expect フレッド in おはよう、フレッド to tokenize to フレッド not and レッド.

Yeah it doesn't parse proper nouns at all because they aren't in JMdict. There isn't a word フレッド but there is a word レッド. There could be all sorts of names around the world so any string of katakana could potentially be someone's or something's name. And native Japanese names are even more complicated, for example two people with the same name as written in kanji could have different pronounciations of their name.

There's currently some support for loading of custom word data, either in xml JMdict format or csv format (see data/sources directory). So if you have a list of names you want segmented, it's technically possible to add it to the word database.