Empty elements not copied to translated output
jelmervdl opened this issue ยท 3 comments
This bug is to track a missing feature in the HTML parsing/reconstruction.
Empty elements (both things like <img>
and <u></u>
without any text between them) are not properly copied over to the translated output.
Brief update on (not pushed) work in progress for anyone interested:
Currently only one of the "taints"[1] is associated with each source token, that's then transferred to the target tokens. I'm trying whether it would work better if I associate all taints that occur around a token to a token, and then transfer that. This works well if all (interesting) source tokens align with a target token, but has two problems:
- If a source token is not aligned with a target token, it could be that elements are lost
- Is a source token is aligned with multiple target tokens, some elements could occur multiple times. For things like
<u>...</u>
that actually makes sense, but for<img/>
much less so. So I would need to keep track of which empty elements have already been inserted, and then skip over them when they occur a second or third time.
Alternative solution I'm trying (which sounds a lot simpler now I write it down):
- Once a target sentence has been formed, go through it and figure out which empty elements are missing. Re-insert them to the nearest known (transferred) element. Interesting part here is that it is a multiple pass thing since the order of elements in source and target might be shuffled around, and a metric for figuring out which target token is the "closet" to the original position.
Images to begin with 8
Images passed through tidy 8
Images at translated 0
I'm losing images on a body pass. Eagerly awaiting a fix here. ๐
I currently have some corruption in large pages possibly due to unclosed tags through the tidy operation. Firefox with standardized HTML shouldn't have an issue I hope.