best practices for encoding control characters?
michaelglass opened this issue · 3 comments
When encoding terminal output that includes control characters (e.g. ESC
) into XML, renderText
produces valid utf-8 that's invalid XML.
Is there some established way of handling this case? If not, If I were to make a PR to make renderText
more resilient to this, what would be the preferred direction?
- escaping the chars somehow
- filtering them out
- wrapping them in CDATA (is this valid? I'm not really an XML pro)
- throwing / returning an either
Although it's not advertised anywhere in its documentation (as far as I checked), xml-conduit
follows version 1.0 of XML standard.
As XML 1.0 explicitly forbids most C0 control codes (only TAB
, LF
and CR
are allowed), it looks like we won't get away without bending some laws.
If I had to make a suggestion, I would recommend:
- using character references to represent forbidden control characters (that's what XML 1.1 does, except for
NUL
) - making this feature opt-in through an additional field in
ParseSettings
Regarding your other proposals:
- "filtering them out" tampers with the content of the XML document, such that the result may not be meaningful anymore to the user ; I would advise against it, even as an opt-in feature;
- "wrapping them in CDATA" is invalid according to XML 1.0 specification, unless I am mistaken;
- "throwing / returning an either" would be my 2nd choice, as it would make the behavior more correct, albeit less useful to (some) users.
should I close this issue or wait until I open a PR?
I suggest we keep this issue open until a fix is merged.
FYI, you can link a PR to an issue, such that merging the former automatically closes the latter.