schierlm/BibleMultiConverter

Disabling use of XML entities for utf-8 characters

shadow-light opened this issue · 2 comments

Hi thanks for this great converter. I'm converting USFM -> USX and noticed that it is producing XML entities instead of utf-8 characters even though the output encoding is utf-8.

Example:

\v 1 Iwamɨ́ó xwɨ́árí tɨ́nɨ aŋɨ́na tɨ́nɨ imɨxɨnɨŋíná eŋo nánɨ —Omɨ arɨ́á wirane negɨ́ sɨŋwɨ́ tɨ́ tɨ́nɨ wɨnɨrane sɨŋwɨ́ wɨnaxɨ́dɨrane wé tɨ́nɨ ɨ́á xɨrɨrane eŋwáorɨnɨ. Xwɨyɨ́á dɨŋɨ́ nɨyɨmɨŋɨ́ imónɨŋɨ́pɨ nánɨ neaíwapɨyiŋorɨnɨ.

<verse number="1" style="v" sid="1JN 1:1"/>Iwam&#616;&#769;&#243; xw&#616;&#769;&#225;r&#237; t&#616;&#769;n&#616; a&#331;&#616;&#769;na t&#616;&#769;n&#616; im&#616;x&#616;n&#616;&#331;&#237;n&#225; e&#331;o n&#225;n&#616; &#8212;Om&#616; ar&#616;&#769;&#225; wirane neg&#616;&#769; s&#616;&#331;w&#616;&#769; t&#616;&#769; t&#616;&#769;n&#616; w&#616;n&#616;rane s&#616;&#331;w&#616;&#769; w&#616;nax&#616;&#769;d&#616;rane w&#233; t&#616;&#769;n&#616; &#616;&#769;&#225; x&#616;r&#616;rane e&#331;w&#225;or&#616;n&#616;. Xw&#616;y&#616;&#769;&#225; d&#616;&#331;&#616;&#769; n&#616;y&#616;m&#616;&#331;&#616;&#769; im&#243;n&#616;&#331;&#616;&#769;p&#616; n&#225;n&#616; nea&#237;wap&#616;yi&#331;or&#616;n&#616;.<verse eid="1JN 1:1"/>

source

This is fine parsing wise, but it significantly increases file size, and I'm planning on serving them over network. Wondering if it's easy to disable this somehow?

This is interesting. We use a custom XMLWriter to overwrite the significant whitespace rules (which are somehow odd in USX). I was not aware that this will automatically switch the character escape handler to DumbEscapeHandler, resulting in everything above U+0100 to be escaped.

It should be possible to get rid of this annoying behaviour, but I will have to have a closer look how exactly.

@Rolf-Smit: I assume you did not notice that behaviour when you did the USX revamp in #39?