steveash/NETransliteration-COLING2018

Mistranslations of Korean/Japanese data

Opened this issue · 4 comments

hello @steveash!
I'm very interested in your project, so I checked your datasets of Korean/Japanese.
(I'm native speaker of Korean, and I'm doing research about Japanese transliteration in master course.)
and I found that your datasets include the following type of mistranslations.
{(first name) (last name) <-> (last name) (first name)}
I refined your datasets in my own methods and I'd like to introduce my method.

  1. First, I created the first model using only data with more than 2 frequency.
  2. by using the first model, I inferenced all the data and measured the edit distance between the correct name and top-1 inferenced name. I regarded the data with editing distance above threshold as an mistranslation and excluded it.
  3. I created the second model using the data except the mistranslation data.

By this method, I was able to reduce some mistranslation data. I think you can also use this method if you want to refine the data.

this is my project using your dataset of Korean. check it out!
https://github.com/noowad/korean-person-name-transliterator

Very cool! Yes since we only tackled token alignment via simple heuristics, there is some noise as you see. Do you have the word error rates after excluding the filtered examples-- how much improvement do you see? Also which neural architecture did you use Transformer or seq2seq?

@steveash
I only measured top-k accuracy (Korean dataset).
the first model's result is:
top1 Accuracy:705/2000=0.3525
top2 Accuracy:929/2000=0.4645
top3 Accuracy:1032/2000=0.516
top4 Accuracy:1085/2000=0.5425
top5 Accuracy:1138/2000=0.569

the second model's result is:
top1 Accuracy:969/2000=0.4845
top2 Accuracy:1231/2000=0.6155
top3 Accuracy:1351/2000=0.6755
top4 Accuracy:1414/2000=0.707
top5 Accuracy:1451/2000=0.7255

and I used different seq2seq model which applied the architecture of Tacotron (google's text-to-speech synthesis model) since I thought that both text-to-speech synthesis and machine transliteration deal with pronunciation.

What edit distance threshold did you use for filtering? And just to clarify you only filtered the training data, right? The test data used in the above is the exact same set in both?

Did you do any error analysis to determine how many of the mistranslations by edit distance we're due to flips or just due to the task being difficult?

I set threshold quite roughly. As word length becomes longer, the editing distance naturally becomes longer too, so I excluded data that word length - edit distance <= 2 (2 is threshold). I think there will be a better method.

I only filtered training data and validate data/test data were same. Test data is in https://github.com/noowad/korean-person-name-transliterator/blob/master/datas/korean_test.txt.

and I did not analyze the errors, but I think that the performance gap between the first model and the second model is due to gap of the amount of training data. In order to measure the effect of noise data reduction, it would be necessary to compare the second model with the model trained on the data without removing the noise data.