Empty elements not copied to translated output

Question

Empty elements not copied to translated output

jelmervdl opened this issue 3 years ago · 3 comments

This bug is to track a missing feature in the HTML parsing/reconstruction.

Empty elements (both things like <img> and <u></u> without any text between them) are not properly copied over to the translated output.

Answer 1 · 2021-12-09T13:55:04.000Z

Brief update on (not pushed) work in progress for anyone interested:

Currently only one of the "taints"[1] is associated with each source token, that's then transferred to the target tokens. I'm trying whether it would work better if I associate all taints that occur around a token to a token, and then transfer that. This works well if all (interesting) source tokens align with a target token, but has two problems:

If a source token is not aligned with a target token, it could be that elements are lost
Is a source token is aligned with multiple target tokens, some elements could occur multiple times. For things like <u>...</u> that actually makes sense, but for <img/> much less so. So I would need to keep track of which empty elements have already been inserted, and then skip over them when they occur a second or third time.

Alternative solution I'm trying (which sounds a lot simpler now I write it down):

Once a target sentence has been formed, go through it and figure out which empty elements are missing. Re-insert them to the nearest known (transferred) element. Interesting part here is that it is a multiple pass thing since the order of elements in source and target might be shuffled around, and a metric for figuring out which target token is the "closet" to the original position.

Answer 2 · 2021-12-17T10:00:19.000Z

Images to begin with 8
Images passed through tidy 8
Images at translated 0

I'm losing images on a body pass. Eagerly awaiting a fix here. 😄

I currently have some corruption in large pages possibly due to unclosed tags through the tidy operation. Firefox with standardized HTML shouldn't have an issue I hope.

Answer 3 · 2021-12-19T15:45:49.000Z

After applying the changes from #279 and #283: