jelmervdl/translatelocally-web-ext

HTML translation leaves untranslated element in page

jelmervdl opened this issue · 3 comments

Screenshot 2022-06-16 at 18 47 26

HTML input:
<a href="/wiki/Wikipedia:%C3%9Cber_Wikipedia" title="Wikipedia:Über Wikipedia" data-x-bergamot-id="0">Wikipedia</a> ist ein Projekt zum Aufbau einer <a href="/wiki/Enzyklop%C3%A4die" title="Enzyklopädie" data-x-bergamot-id="1">Enzyklopädie</a> aus <a href="/wiki/Freie_Inhalte" title="Freie Inhalte" data-x-bergamot-id="2">freien Inhalten</a>, zu denen <a href="/wiki/Wikipedia:Starthilfe" title="Wikipedia:Starthilfe" data-x-bergamot-id="3">du sehr gern beitragen kannst</a>. Seit März 2001 sind <a href="/wiki/Spezial:Statistik" title="Spezial:Statistik" data-x-bergamot-id="4">2.698.940</a>&nbsp;Artikel in deutscher Sprache entstanden.

HTML output from translator in this example (TranslateLocally):
<a href="/wiki/Wikipedia:%C3%9Cber_Wikipedia" title="Wikipedia:Über Wikipedia" data-x-bergamot-id="0">Wikipedia</a> is a project to set up an <a href="/wiki/Enzyklop%C3%A4die" title="Enzyklopädie" data-x-bergamot-id="1">encyclopedia</a> made up of <a href="/wiki/Freie_Inhalte" title="Freie Inhalte" data-x-bergamot-id="2">free</a> <a href="/wiki/Wikipedia:Starthilfe" title="Wikipedia:Starthilfe" data-x-bergamot-id="3"></a><a href="/wiki/Freie_Inhalte" title="Freie Inhalte" data-x-bergamot-id="2">content</a>, to which <a href="/wiki/Wikipedia:Starthilfe" title="Wikipedia:Starthilfe" data-x-bergamot-id="3">you are very happy to contribute</a>. Since March 2001<a href="/wiki/Spezial:Statistik" title="Spezial:Statistik" data-x-bergamot-id="4">, 2,698,940</a> articles have been produced in German.

Notice the element with x-bergamot-id="3" being repeated in the output. First time, it is empty. The InPlaceTranslation class did not empty the first instance, which is why you still see the text node in the page. So that's bug 1.

However, I also don't understand why the x-bergamot-id="3" element is repeated. bergamot-translator's HTML class does this for some elements to make sure they don't get lost in translation, but this is not an element where this would happen as it has text that will be in the translation in it. So that's maybe bug 2.

Soo… it's an alignment issue. It's not really a bug in bergamot-translator, that part works as intended:

Printing tags for each output token (tokens can be empty, e.g. end-of-sentence token)

...

Token: pedia
Tags:  <a href="/wiki/Enzyklop%C3%A4die" title="Enzyklopädie" data-x-bergamot-id="1">

Token:  made
Tags:

Token:  up
Tags:

Token:  of
Tags:

########## Here comes the culprit ##########

Token:  free
Tags:  <a href="/wiki/Freie_Inhalte" title="Freie Inhalte" data-x-bergamot-id="2">

Token:
Tags:  <a href="/wiki/Wikipedia:Starthilfe" title="Wikipedia:Starthilfe" data-x-bergamot-id="3">

Token: content
Tags:  <a href="/wiki/Freie_Inhalte" title="Freie Inhalte" data-x-bergamot-id="2">

########## 2 -> 3 (empty text) -> 2 gives us empty <a id=3></a> ##########

Token: ,
Tags:

Token:  to
Tags:

Token:  w
Tags:

Token: hich
Tags:

Token:  you
Tags:  <a href="/wiki/Wikipedia:Starthilfe" title="Wikipedia:Starthilfe" data-x-bergamot-id="3">

Token:  are
Tags:  <a href="/wiki/Wikipedia:Starthilfe" title="Wikipedia:Starthilfe" data-x-bergamot-id="3">

...

In the context of bergamot-translator: not ideal behaviour. Maybe it's useful to add some filtering on how empty stuff gets aligned, especially when it is the odd one out in a sequence.

In the context of InPageTranslation: Tricky case, since some elements are treated as opaque even if they have text in them (e.g. a bit of <script>code here</script>). One way to go about is would be to see whether the translated element also has children or not, and if not, remove the ones from the source element. But that won't work for something like <a>beep<img></a> (translated to <a><img></a>).

This was fixed with af3eff1

Related: jelmervdl/bergamot-translator@992c584 (possible improvement, but not necessary to fix this)