browsermt/bergamot-translator

Empty elements not copied to translated output

jelmervdl opened this issue ยท 3 comments

This bug is to track a missing feature in the HTML parsing/reconstruction.

Empty elements (both things like <img> and <u></u> without any text between them) are not properly copied over to the translated output.

Brief update on (not pushed) work in progress for anyone interested:

Currently only one of the "taints"[1] is associated with each source token, that's then transferred to the target tokens. I'm trying whether it would work better if I associate all taints that occur around a token to a token, and then transfer that. This works well if all (interesting) source tokens align with a target token, but has two problems:

  1. If a source token is not aligned with a target token, it could be that elements are lost
  2. Is a source token is aligned with multiple target tokens, some elements could occur multiple times. For things like <u>...</u> that actually makes sense, but for <img/> much less so. So I would need to keep track of which empty elements have already been inserted, and then skip over them when they occur a second or third time.

Alternative solution I'm trying (which sounds a lot simpler now I write it down):

  1. Once a target sentence has been formed, go through it and figure out which empty elements are missing. Re-insert them to the nearest known (transferred) element. Interesting part here is that it is a multiple pass thing since the order of elements in source and target might be shuffled around, and a metric for figuring out which target token is the "closet" to the original position.

image

Images to begin with 8
Images passed through tidy 8
Images at translated 0

I'm losing images on a body pass. Eagerly awaiting a fix here. ๐Ÿ˜„

I currently have some corruption in large pages possibly due to unclosed tags through the tidy operation. Firefox with standardized HTML shouldn't have an issue I hope.

After applying the changes from #279 and #283:
image