TinoDidriksen/Transfuse

docx creates nested runs (<w:r><w:t><w:r><w:t>), which are then invisible in the opened document

Closed this issue · 3 comments

$ for y in yes no; do APERTIUM_TRANSFUSE=$y apertium -f docx -u -d . nob-nno  /tmp/in.docx >/tmp/ut.$y.docx; done

in.docx

With transfuse, we get this bit:

      <w:r>
        <w:rPr/>
        <w:t xml:space="preserve">
          <w:r>
            <w:rPr/>
            <w:t xml:space="preserve">Dette er såkalla «Sideloaded Add-ins». Dei nyttar eit webview, i praksis ein nettlesar</w:t>
          </w:r>
        </w:t>
      </w:r>

which word (and libreoffice) don't show on opening the document, presumably nested runs aren't allowed in OOXML.

(Note: If I first save in.docx from Libreoffice, transfuse can handle it fine, because LO merges all the runs in the input paragraph on saving (removing the proofErr stuff).)

Strangely, the text isn't divided in input to the pipeline. The split-point is nettleser som, and input xml has

        <w:t xml:space="preserve">». De benytter et webview, i praksis en nettleser som er bygget inn i Office-programmene, for å vise innholdet sitt og utføre oppgavene sine. </w:t>

Here's what it looks like for the first step of the pipeline:

[transfuse:\/tmp\/transfuse-D7x3hD-8b_Y]

[tf-block:1-Zh6TUA]

Teknologi.[]

[tf-block:2-SKmwAw]

[[t:text:SyTAKg]]Dette er såkalte «Sideloaded Add-ins». De benytter et webview, i praksis en nettleser som er bygget inn i Office-programmene, for å vise innholdet sitt og utføre oppgavene sine.[[/]] .[]

Then right after the full pipeline, we have wordbound tags galore:

[transfuse:\/tmp\/transfuse-D7x3hD-8b_Y]

[tf-block:1-Zh6TUA]

Teknologi.[]

[tf-block:2-SKmwAw]

[[t:text:SyTAKg]]Dette[[/]] [[t:text:SyTAKg]]er[[/]] [[t:text:SyTAKg]]såkalla[[/]] [[t:text:SyTAKg]]«[[/]][[t:text:SyTAKg]]Sideloaded[[/]] [[t:text:SyTAKg]]Add-[[/]][[t:text:SyTAKg]]ins[[/]][[t:text:SyTAKg]]»[[/]][[t:text:SyTAKg]].[[/]] [[t:text:SyTAKg]]Dei[[/]] [[t:text:SyTAKg]]nyttar[[/]] [[t:text:SyTAKg]]eit[[/]] [[t:text:SyTAKg]]webview[[/]][[t:text:SyTAKg]],[[/]] [[t:text:SyTAKg]]i[[/]] [[t:text:SyTAKg]]praksis[[/]] [[t:text:SyTAKg]]ein[[/]] [[t:text:SyTAKg]]ne[[t:text:SyTAKg]]ttl[[/]]esar[[/]] [[t:text:SyTAKg]]som[[/]] [[t:text:SyTAKg]]er[[/]] [[t:text:SyTAKg]]bygd inn[[/]] [[t:text:SyTAKg]]i[[/]] [[t:text:SyTAKg]]Office-[[/]][[t:text:SyTAKg]]programma[[/]][[t:text:SyTAKg]],[[/]] [[t:text:SyTAKg]]for[[/]] [[t:text:SyTAKg]]å[[/]] [[t:text:SyTAKg]]visa[[/]] [[t:text:SyTAKg]]innhaldet[[/]] [[t:text:SyTAKg]]sitt[[/]] [[t:text:SyTAKg]]og[[/]] [[t:text:SyTAKg]]utføra[[/]] [[t:text:SyTAKg]]oppgåvene[[/]] [[t:text:SyTAKg]]sine[[/]][[t:text:SyTAKg]].[[/]] .[]

The second-to-last step, before postgenerator, looks like
[[t:text:SyTAKg]]ne~tt[[/]][[t:text:SyTAKg]]lesar[[/]]
at the split-point, while after postgenerator we get
[[t:text:SyTAKg]]ne[[t:text:SyTAKg]]ttl[[/]]esar[[/]] [[t:text:SyTAKg]]som[[/]]

So is the issue here that postgenerator should not be creating these nested word blanks, or that transfuse should somehow know how to deal with nested word blanks?

@mr-martian does your apertium/lttoolbox#144 avoid nested word blanks in postgen?

The pipe may not yield nested structures, nor will Transfuse give it nested structures, so that looks like a bug in postgen.