emk/subtitles-rs

In combined mode, map words to their traduction

Closed this issue · 4 comments

teto commented

First of all thanks for the amazing tool. It's super nice to be able to combine both subtitles.

Ideally I would like to be able to select words from a subtitlte and act on them like https://animelon.com/video/594e558522477e0e27f1f035 does

I think you have a similar project with #22 .
A more short term goal could be, in the srt file, to map each words to their translation (in combined mode) via different colors /underline where that makes sense.

emk commented

This is a good idea!

To implement this would require some moderately sophisticated machine learning—basically the first step of a typical statistical translation engine—and a good-sized subtitle corpus. But the output would be really useful, I think. Somebody if I have an urge to do some machine learning, I might try this.

Thank you for the idea!

teto commented

To implement this would require some moderately sophisticated machine learning

I wonder about that. At least for a first approach because we have both subtitles, it might be possible to just compare side by side the 2 subtitles and try to look for the translated words from one subtitle to the other.
Like "I am your father" / "Je suis ton pere". A simple dictionary lookup could map "father" to "pere".
I am not sure what's best but being able to pass a hook to subtitles-rs like --hook program fr en and subtitles-rs could call the program with the 2 sentences and the program return the mapping:

{"father": "pere",
"I": "Je"
"am": "suis"
"your": "ton"
}
emk commented

So there's a bunch of ways to do this. The fundamental problem is that there's no easy 1-to-1 mapping. Multiple words may map to single words, and vice-versa, and words may be totally reordered.

Even worse, you get things like "J'ai pu m'en occupé, en fait" / "I could take take of it, in fact", where the first "en" means "of it/them/etc", and the second means "in" (roughly). So you need to break up the sentences as:

  • ["J'", "ai", "pu", "m'", "en", "occupé", "en", "fait"]
  • ["I", "could", "take", "care", "of", "it", "in", "fact"]

...and then you can map:

  • [4] to [4,5]
  • [6] to [6]

Also "m'" and "occupé" need to map to "take" and "care":

  • [3, 5] to [2, 3]

...and so on.

Basically, this is a pretty well-understood problem. Linguee does this, as do subroutines inside things like Google Translate. The OPUS Open Subtitles corpus could be used to train this. You might find some example algorithms on Coursera.

I think it's an excellent feature suggestion, and there are some ways to implement it very well. Sadly, I won't have time to work on them any time soon, because I'm more interested on vastly improving subtitle OCR, and I barely have time to work on that. :-(

But if somebody wanted to tackle this, I already have code that could be adapted to extract the necessary training data from OPUS, and I'd be happy to provide advice and review patches.

emk commented

I'm more likely to handle this via an LLM-powered "explain this word/phrase" feature. The true semantic word-level alignment is pretty fair out of scope for what substudy can handle internally.