Shinmera/plump

Quoting in strings

Closed this issue · 5 comments

Control chracters (tab, CR, NL, and probably others, too) in strings are not properly quoted:

(let ((plump:*tag-dispatchers* plump:*xml-tags*)) (plump:serialize (plump:parse "<foo bar=\"a&#x9;b&#xD;c&#xA;d\" />")))

results in tab, CR and LF emitted literally instead of encoded in the way there were parsed.

Attached patch fixes it for me. But I am not confident whether this is the proper way to fix it.

encode.txt

(ugh? I can't attch files with .patch extension?)

Plump makes no promises about round-tripping documents. As far as I understand the XML spec, it is perfectly valid to emit literal tabs and newlines into attribute values, too.

Your fix has multiple issues, such as using #\Return instead of #\Linefeed and also leading to anything in the document, even plaintext bodies, now being encoded instead of remaining plain.

Thanks for your comment, @Shinmera !

Round-tripping is not the goal here. But I'd assume not changing semantics would be great.

Although I am not an XML expert, it seems to me that current behaviour changes semantics of the content. If I understand this W3C recommendadion correctly, literal whitespace in attribute values should be passed as #\Space to the application. In Contrast, encoded attribute values should pass the encoded value to the application. IMHO, this is a significant change in semantics.
Even when there's no (syntactic) requirement to encode such characters, it seems to be at least "good practice" to preserve semantics. Having at least an option to encode them would be great.

Addressing your comments about the patch:

  • As I wrote, I am not confident about the patch. That's why I have put this "quick-n-dirty attempt for demonstration" here for discussion instead of creating a PR.
  • Well, this should be: &#xD; is #\Return and &#xA; is #\Linefeed.
  • Plain text is not encoded. Hunk # 1 of the patch disables encoding for plain text. But the patch fails to encode at least #\<, #\& and #\" in plaintext.

Err, sorry, in my initial comment I meant "using #\Newline instead of #\Linefeed". The newline character is implementation dependent and may not necessarily equal LF.

I'll have a think about this, but can't promise to patch it soon

I'll be happy to provide a PR once we have a common understanding about the issue and whether/how to fix it.