ikawaha/kagome

romaji transliteration

eadmaster opened this issue · 11 comments

is it possible to use the cmdline tool for interactive romaji transliteration?
e.g.

$ kagome -romaji
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.

(same as cutlet)

KEINOS commented

@eadmaster

AFAIK,kagome doesn't provide a direct romaji translation.

The simplest solution would be to use the "pronunciation" element of the JSON output to get the katakana reading and map it to the romaji somehow.

$ echo "ローマ字変換プログラム作ってみた。" | kagome -json | jq -r '.[].pronunciation'
ローマジ
ヘンカン
プログラム
ツクッ



The disadvantage of this method is that the accuracy depends on the quality of the dictionary. Some words have no pronunciation field in the default dictionary.

KEINOS commented

... map it to the romaji somehow.

Here's a simple example using kana, the alternative of cutlet in Go.

package main

import (
	"fmt"
	"strings"
	"unicode"

	"github.com/gojp/kana"
)

func main() {
	input := `ローマジ
ヘンカン
プログラム
ツクッ




`
	lines := strings.Split(input, "\n")

	for _, line := range lines {
		line = strings.TrimSpace(line)

		yomi := strings.Map(func(r rune) rune {
			if unicode.IsLetter(r) {
				return r
			}
			return -1

		}, kana.KanaToRomaji(line))

		if yomi == "" {
			continue
		}

		fmt.Println(yomi)
	}

}
// Output:
// romaji
// henkan
// puroguramu
// tsuku
// te
// mi
// ta

@eadmaster

I've been interested in this issue for some time, ever since it was first proposed, and have tried to develop patterns that can be applied in practice.

This is my current solution. If this is ok to you, I would like to PR to add it to the "_examples" directory.

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/gojp/kana"
	"github.com/ikawaha/kagome-dict/dict"
	"github.com/ikawaha/kagome-dict/ipa"
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
	input := `
ローマ字変換プログラム作ってみた。
五街道のひとつである、東海道五十三次の品川宿などを変換してみると面白いかもしれない。
`
	usrDict := `
東海道五十三次,東海道 五十三 次,トウカイドウ ゴジュウサン ツギ,カスタム名詞
品川宿,品川 宿,シナガワ ジュク,カスタム名詞
`

	// Convert user dictionary string to tokenizer.Option
	usrDictOpt, err := newUserDictOpt(usrDict)
	if err != nil {
		log.Fatal(err)
	}

	// Create IPA-dict-based tokenizer with user dictionary
	tkn, err := tokenizer.New(ipa.Dict(), usrDictOpt, tokenizer.OmitBosEos())
	if err != nil {
		log.Fatal(err)
	}

	// Split input text by line
	lines := strings.Split(input, "\n")

	for _, line := range lines {
		if strings.TrimSpace(line) == "" {
			continue // ignore empty lines
		}

		tokens := tkn.Tokenize(line)
		chunks := []string{}
		tmpChunk := ""

		// Evaluate each token to retrieve the pronunciation or reading as a
		// slice of chunks to join them later. It is similar to Wakachi, but
		// with a bit more complex logic.
		for _, token := range tokens {
			if usrExtra := token.UserExtra(); usrExtra != nil {
				tmpChunk = strings.Join(usrExtra.Readings, " ")
			} else if p, ok := token.Pronunciation(); ok {
				tmpChunk = p
			} else if r, ok := token.Reading(); ok {
				tmpChunk = r // fallback to reading if pronunciation is not available
			} else {
				tmpChunk = token.Surface
			}

			tmpChunk = strings.TrimSpace(tmpChunk)
			//fmt.Println("Log:", tmpChunk, token.POS())

			if isPartOfPrev(token) {
				chunks[len(chunks)-1] += tmpChunk // Append to the previous chunk
			} else {
				chunks = append(chunks, tmpChunk) // Append to the slice of chunks
			}
		}

		fmt.Println(kana.KanaToRomaji(strings.Join(chunks, " ")))
	}
	// Output:
	// ro-maji henkan puroguramu tsukutte mita。
	// go kaido- no hitotsudearu、 toukaidou gojuusan tsugi no shinagawa juku nado wo henkan shite miruto omoshiroi kamo shirenai。
}

// isPartOfPrev returns true if the token prefers to be part of the previous chunk.
//
// e.g. tsuku te mi ta。-> tsukutte mita。
func isPartOfPrev(token tokenizer.Token) bool {
	// Not "助詞" "助動詞" or "記号"
	if !strings.ContainsAny(token.POS()[0], "助"+"記") {
		return false
	}

	switch token.POS()[1] {
	case "副助詞", "連体化", "格助詞":
		return false
	default:
		return true
	}
}

// newUserDictOpt creates a tokenizer.Option from a user dictionary string.
func newUserDictOpt(rec string) (tokenizer.Option, error) {
	usrDictRec, err := dict.NewUserDicRecords(strings.NewReader(rec))
	if err != nil {
		return nil, err
	}
	usrDict, err := usrDictRec.NewUserDict()
	if err != nil {
		return nil, err
	}
	return tokenizer.UserDict(usrDict), nil
}

my ideal solution would be adding a switch into the main binary.
btw if this requires adding another dependency maybe it's better to have it as an example.

my ideal solution would be adding a switch into the main binary.

You are probably looking for a feature such as:

$ echo "ローマ字変換プログラム作ってみた。" | kagome yomi 
ローマジ ヘンカン プログラム ツクッテ ミタ。

$ echo "ローマ字変換プログラム作ってみた。" | kagome yomi -katakana
ローマジ ヘンカン プログラム ツクッテ ミタ。

$ echo "ローマ字変換プログラム作ってみた。" | kagome yomi -hiragana
ろーまじ へんかん ぷろぐらむ つくって みた。

$ echo "ローマ字変換プログラム作ってみた。" | kagome yomi -romaji
ro-maji henkan puroguramu tsukutte mita。

$ echo "五街道のひとつである、東海道五十三次の品川宿などを変換してみると面白いかもしれない。" > text.txt

$ kagome yomi -file text.txt -userdict mydict.txt -romaji
go kaido- no hitotsudearu、 toukaidou gojuusan tsugi no shinagawa juku nado wo henkan shite miruto omoshiroi kamo shirenai。

If so, I agree. As I also use Kagome on TTS reader very much. And feel that it is useful for creating romaji-subtitles as well.

btw if this requires adding another dependency maybe it's better to have it as an example.

The main problem is that there are too many variations in the romanization of Japanese. Such as: Nihon-shiki, Kunrei-shiki (or ISO-3602), Traditional Hepburn, Modified Hepburn, JSL romanization, etc.

To support them we need to implement custom user dictionary for romanization and define its format.

$ kagome yomi -file text.txt -userdict mydict.txt -romajidict my_hepburn.txt -romaji
go kaidō no hitotudearu、 toukaidō gojyūsan tugi no sinagawa juku nado o henkan site miruto omosiroi kamo sirenai。

At this point, it would be ideal to add the example to the "_examples" directory first.

Next, open an issue to ask support for the "yomi" subcommand for Katakana/Hiragana readings. Then implement the "-romaji" option for the "yomi" subcommand.

ok, no problem for me.

I've managed to write a tool for lyrics transliteration in go thanks to the examples posted here. But currently i am getting better results with cutlet which it is also able to detect foreign words.

I've managed to write a tool for lyrics transliteration in go thanks to the examples posted here.

Nice. This example helps to get a concrete picture. I think fine-tuning the details is the difficult part.

But currently i am getting better results with cutlet which it is also able to detect foreign words.

Indeed. "Cutlet" is a cool python application.

The "gojp/kana" package, on the other hand, has been inactive for more than five years. And I have to admit that it is inaccurate in some cases.

But Kagome is positioned in the same way as MeCab, implementation of transliteration itself is out of scope and not ideal.

Thus, we need to search for an alternative package, such as:

Can you provide us some examples? Something like:

testData := []struct {
	input  string
	expect string
}{
	{input: "こぼれたままの流星群  一秒  一秒", expect: "Koboreta mama no ryu-sei gun ichi byo- ichi byo-"},
	{input: "流転 lights 消せないコナゴナ銀河。", expect: "Ruten LIGHTS kesenai konagona ginga."},
}

The more examples the better.

By the way, the above test is the current output of the example that I'm working on.

_example/romaji_transliteration (WIP)
package main

import (
	"fmt"
	"log"
	"strings"
	"unicode"

	"github.com/gojp/kana"
	"github.com/ikawaha/kagome-dict/dict"
	"github.com/ikawaha/kagome-dict/ipa"
	"github.com/ikawaha/kagome/v2/tokenizer"
	"golang.org/x/text/width"
)

func main() {
	// User input text
	input := `
ローマ字変換プログラム作ってみた。
五街道のひとつである、東海道五十三次の品川宿などを変換してみると面白いかもしれない。
こぼれたままの流星群  一秒  一秒
流転 lights 消せないコナゴナ銀河。
`

	// Built-in user dictionary
	usrDict := `
東海道五十三次,東海道 五十三 次,トウカイドウ ゴジュウサン ツギ,カスタム名詞
品川宿,品川 宿,シナガワ ジュク,カスタム名詞
`

	// Convert user dictionary string to tokenizer.Option
	usrDictOpt, err := newUserDictOpt(usrDict)
	if err != nil {
		log.Fatal(err)
	}

	// Create IPA-dict-based tokenizer with user dictionary
	tkn, err := tokenizer.New(ipa.Dict(), usrDictOpt, tokenizer.OmitBosEos())
	if err != nil {
		log.Fatal(err)
	}

	// Split input text by line
	lines := strings.Split(input, "\n")

	for _, line := range lines {
		// Get Yomi (pronunciation/reading) from the line in Katakana
		yomi := getYomi(tkn, line)
		if yomi == "" {
			continue // ignore empty lines
		}

		// Transliterate to Romaji.
		romaji := getRomantic(yomi)

		fmt.Println(romaji)
	}
	//
	// Output:
	// ro-maji henkan puroguramu tsukutte mita.
	// go kaido- no hitotsudearu, toukaidou gojuusan tsugi no shinagawa juku nado wo henkan shite miruto omoshiroi kamo shirenai.
}

var conversionMap = map[rune]rune{
	'、': ',',
	'。': '.',
	'!': '!',
	'?': '?',
	'「': '"',
	'」': '"',
	'『': '"',
	'』': '"',
}

// getRomantic returns the Romaji transliteration of the input in Katakana.
func getRomantic(line string) (yomi string) {
	defer func() {
		// Finally, remove extra spaces
		if yomi != "" {
			yomi = strings.Join(strings.Fields(yomi), " ")
		}
	}()

	// In this example we use the github.com/gojp/kana package for Katakana
	// transliteration to Romaji. However, other packages are available.
	// Such as:
	// - github.com/robpike/nihongo
	// - github.com/kotaroooo0/gojaconv
	// - github.com/yosida95/romaji
	// - github.com/goark/krconv
	romaji := kana.KanaToRomaji(line)

	// Barely normalize full-width chars to half-width ('、' -> ',', '。' -> '.', etc.)
	romaji = convToHalfWidth(romaji)

	// Capitalize the first letter of each sentence
	sentences := strings.SplitAfter(romaji, ".")

	for index, sentence := range sentences {
		sentence = strings.TrimSpace(sentence)
		isFirst := true

		// Capitalize the first letter of each sentence
		sentence = strings.Map(func(r rune) rune {
			if isFirst {
				isFirst = false

				return unicode.ToUpper(r)
			}

			return r
		}, sentence)

		//sentences[index] = cases.Title(language.English).String(sentence)
		sentences[index] = sentence
	}

	return strings.Join(sentences, " ")
}

// convToHalfWidth converts full-width alpha-numeric characters to half-width
// characters according to the conversionMap.
func convToHalfWidth(input string) string {
	// Convert half-width katakana characters to full-width and full-width
	// alphanumeric characters to half-width.
	input = width.Fold.String(input)

	return strings.Map(func(r rune) rune {
		if unicode.Is(unicode.Han, r) {
			return r
		}

		if converted, ok := conversionMap[r]; ok {
			return converted
		}

		return r
	}, input)
}

// getYomi returns the pronunciation/reading (Yomi) of the input in Katakana
// using the given tokenizer.
func getYomi(tkn *tokenizer.Tokenizer, line string) string {
	line = strings.TrimSpace(line)
	if line == "" {
		return ""
	}

	if isASCII(line) {
		return line
	}

	tokens := tkn.Tokenize(line)
	chunks := []string{}
	tmpChunk := ""
	isPrevASCII := false

	// Evaluate each token to retrieve the pronunciation or reading as a
	// slice of chunks to join them later. It is similar to Wakachi, but
	// with a bit more complex logic.
	for _, token := range tokens {
		prevKey := len(chunks) - 1

		// Detect ASCII words
		if isASCII(token.Surface) {
			if isPrevASCII {
				chunks[prevKey] += token.Surface
			} else {
				chunks = append(chunks, token.Surface)
				isPrevASCII = true
			}

			continue
		} else if isPrevASCII {
			// Capitalize the previous chunk if it was all in ASCII
			chunks[prevKey] = strings.ToUpper(chunks[prevKey])
		}

		isPrevASCII = false

		// Retrieve the pronunciation/reading from the token in katakana
		if usrExtra := token.UserExtra(); usrExtra != nil {
			tmpChunk = strings.Join(usrExtra.Readings, " ")
		} else if p, ok := token.Pronunciation(); ok {
			tmpChunk = p
		} else if r, ok := token.Reading(); ok {
			tmpChunk = r // fallback to reading if pronunciation is not available
		} else {
			tmpChunk = token.Surface
		}

		tmpChunk = strings.TrimSpace(tmpChunk)
		//fmt.Println("Log:", tmpChunk, token.POS())

		if isPartOfPrev(token) {
			chunks[prevKey] += tmpChunk // Append to the previous chunk
		} else {
			chunks = append(chunks, tmpChunk) // Append to the slice of chunks
		}
	}

	return strings.Join(chunks, " ")
}

// isASCII returns true if the string is all in ASCII.
func isASCII(s string) bool {
	for i := 0; i < len(s); i++ {
		if s[i] > unicode.MaxASCII {
			return false
		}
	}

	return true
}

// isPartOfPrev returns true if the token prefers to be part of the previous chunk.
//
// e.g. tsuku te mi ta。--> tsukutte mita。
func isPartOfPrev(token tokenizer.Token) bool {
	// Not "助詞" "助動詞" nor "記号"
	if !strings.ContainsAny(token.POS()[0], "助"+"記") {
		return false
	}

	switch token.POS()[1] {
	// Ignore below particles, conjunctions, and auxiliary verbs
	case "副助詞", "連体化", "格助詞":
		return false
	// Else, consider as part of the previous chunk
	default:
		return true
	}
}

// newUserDictOpt creates a tokenizer.Option from a user dictionary string.
func newUserDictOpt(rec string) (tokenizer.Option, error) {
	// Read user dictionary records from the string.
	usrDictRec, err := dict.NewUserDicRecords(strings.NewReader(rec))
	if err != nil {
		return nil, err
	}

	// Create a dict.UserDict from the records.
	usrDict, err := usrDictRec.NewUserDict()
	if err != nil {
		return nil, err
	}

	// Cast the UserDict to tokenizer.Option.
	return tokenizer.UserDict(usrDict), nil
}

seems a good example to me, especially the symbols conversion map is much needed.

If you need more text examples, you can find a lot on this site: https://www.animelyrics.com/
Song lyrics often include English words that can be written either with: katakana, Latin full-width or Latin half-width.

some examples:

Another use case i have for this is converting filenames with japanese chars, often coming from audio CDs.

e.g.: https://vgmdb.net/album/35606

For this use the script should produce only basic ASCII strings (in the range 0-127).

edit: for easier reusability the script should act as a unix filer: read 1 line from stdin and output 1 line stdout.

@eadmaster

I am working on it, but it is becoming more and more like a real application (more complex) and is not suitable for sample applications.

The examples should be as simple as possible to illustrate the basic use of "kagome" as a library, right?

To close this issue, I am considering the following steps. What do you think?

  1. PR for a simple example to retrieve yomis (readings) in Katakana from the tokens.
  2. Create a new repo for this application. An application that specializes in transliterating Japanese song lyrics to Romaji using "kagome".

sure, no problem for me. For symbols conversions i've found there are some libraries than can do that, so you may remove that hardcoded table and use them instead.

From the cmdline:

  • iconv --to-code ASCII//TRANSLIT//IGNORE
  • uconv -x 'Any-Latin;Latin-ASCII' # from icu-devtools

uconv can also do kana->romaji transliteration , so it could be the library of choice to replace "gojp/kana" :

$ echo "ろーまじ へんかん ぷろぐらむ つくって みた。" | uconv -x 'Any-Latin;Latin-ASCII'
romaji henkan puroguramu tsukutte mita.