snoyberg/xml

best practices for encoding control characters?

michaelglass opened this issue · 3 comments

When encoding terminal output that includes control characters (e.g. ESC) into XML, renderText produces valid utf-8 that's invalid XML.

Is there some established way of handling this case? If not, If I were to make a PR to make renderText more resilient to this, what would be the preferred direction?

  • escaping the chars somehow
  • filtering them out
  • wrapping them in CDATA (is this valid? I'm not really an XML pro)
  • throwing / returning an either
k0ral commented

Although it's not advertised anywhere in its documentation (as far as I checked), xml-conduit follows version 1.0 of XML standard.
As XML 1.0 explicitly forbids most C0 control codes (only TAB, LF and CR are allowed), it looks like we won't get away without bending some laws.

If I had to make a suggestion, I would recommend:

Regarding your other proposals:

  • "filtering them out" tampers with the content of the XML document, such that the result may not be meaningful anymore to the user ; I would advise against it, even as an opt-in feature;
  • "wrapping them in CDATA" is invalid according to XML 1.0 specification, unless I am mistaken;
  • "throwing / returning an either" would be my 2nd choice, as it would make the behavior more correct, albeit less useful to (some) users.

should I close this issue or wait until I open a PR?

k0ral commented

I suggest we keep this issue open until a fix is merged.
FYI, you can link a PR to an issue, such that merging the former automatically closes the latter.