Final small tsu ッ not transliterated
nicolas-raoul opened this issue · 6 comments
I've came up with a workaround for this that consists in merging 2 consecutive tokens into 1, i'm going send you a pull request for this too!
Btw, a small tsu at the end of the word may also indicate an exclamation mark.
Thanks :-)
Hello!
Thank you for building this tool!
I am running into incidences of tokens that either are :
- not converted to romaji (end up in the romaji string as kanjis because the code sees them as "サ変接続" and inserts the surface token into the string buffer). example: 誕生
- not inserted into the romaji string buffer at all, example: もらった (only the final 'ta' gets inserted into the romaji stringbuffer
The source for the jakaroma class has a variable that does not seem to get used in the end: lastTokenToMerge?
Do you have any suggestions, or thoughts? I'm a beginner with Java so not sure how much I can contribute, but if you point me in the right direction, I'm happy to try and push things ahead a bit. For now I've taken the stop gap approach of creating an array list of exceptions to the "サ変接続" classification which must be added to manually as these occurrences arise, and which then get correctly converted and inserted into the romaji string buffer. Probably not the best way forward but makes me feel like I'm making some sort of progress each time there is a problem with it :)
For the small tsu issue, it looks like someone had started to implement a fix, but the code doesn't actually merge the token ending with small tsu with the next one (if I'm understanding the intent correctly). Was this 'lastTokenToMerge' variable supposed to be evaluated by another if clause, that tells the next token to prepend it to itself (and I imagine, double the first consonant)? I'm going to implement that here for myself but wanted to make sure I had understood your intent?
Thanks again for making this tool!
@malkazoid Thanks for the feedback! Unfortunately I don't remember much of the code and have other very busy projects, but I am looking forward to your pull requests :-)
I just downloaded the tool and tested a bit, indeed the behavior is very broken.
もらった
returns Ta
whereas it should return Moratta
, which by the way means that the っ
needs to look at the next letter and double it.
誕生
returns 誕生
whereas it should return Tanjo-
or similar
誕生日
returns 誕生Bi
whereas it should return Tanjo-bi
or similar
すごっ
returns Sugo
which is not bad, Sugo!
would be good too I guess.
ピッザ
returns ピッザ
whereas it should return Pizza