snoyberg/xml

Rendering of XML is lossy

tysonzero opened this issue · 15 comments

Specifically parseText def . renderText def is not equivalent to Right. This means you cannot render a Document and send it over the wire and then parse it on the other side and expect to get back what you put in.

For constrast Value from Aeson does not suffer from this issue, largely because it benefits from JSON being much simpler than XML.

One specific lossy-ness is from [NodeContent "x",NodeContent "y"] being rendered to just xy and thus re-parsed as [NodeContent "xy"].

I am not aware of a reasonable and spec-based way to render the two above structures differently, which means that if injectiveness is desired then the above must be unified to the same Haskell value.

One option for achieving that is replacing [Node] with something isomorphic to ([(Text, Node)], Text) where Node no longer has a NodeContent constructor. Another alternative is to have NodeContent contain Char instead of Text.

I understand if the complexity of the above is not desired, and that losing the injectivity of Document is an acceptable cost, but I wanted to bring this issue up for discussion.

k0ral commented

I hit the same issue when writing property tests to prove that parse . render = id (conceptually).
I resorted to manually merging Nodes, cf here.
I didn't have other cases of lossiness though, can you provide other examples ?

Would it be acceptable if xml-conduit provided an idempotent normalize :: ConduitT Event Event m () function that removes redundancy in the event stream such that parse . render . normalize = id ?

So I am looking into building some FromXML classes and associated helper functions to make parsing a variety of XML easier, basically imitating functions like .: from Aeson. It thus seems natural to also provide ToXML classes.

When thinking about this design I noticed that it would be quite common for FromXML to basically just grab out all the text from a node and parse that (think dates, numbers, strings themselves etc.), which then means the natural ToXML is to spit out a text node. However this would mean that it would be very easy to accidentally fail to round-trip if you had a few of these values next to each other without scoping them underneath some named node.

Based on the above purpose it would really be ideal if the types themselves ruled out accidentally writing a lossy ToXML instance. Choosing to require that the outputted type is an Element would technically work, but would not be much fun in practice. It would require giving some random element name to all the above serializations, and serializing a dict by just serializing its fields with the field name as the element name would involve excessive nesting / unclean XML.

The other case I found is that [NodeContent ""] is equivalent to just [] and roundtrips to the latter.

k0ral commented

Off the top of my head, I can't see a way to encode that constraint into the types without making the API significantly more complex. If you have a specific implementation in mind, feel free to submit a pull-request.

So I think the two ways I gave, either treating chars as nodes themselves instead of Text, or treating the contents of a node as a text separated and padded list of non-text nodes, are the two main options that enforce this invariant structurally.

Personally I think the latter makes a lot of sense. Often when you are looking at the children of a node you only really care about the non-text children (particularly when the text nodes are just whitespace), and this makes that easy (but without being lossy).

If neither of those are acceptable. Then another alternative is simply a smart constructor / newtype for the node list that concatenates subsequent text nodes and deletes empty text nodes.

k0ral commented

If I understand correctly your preferred solution, it would require re-defining at least:

  • Node to separate textual from non-textual nodes
  • Element to replace [Node] with type NodeCollection = (TextNode, [NonTextNode, TextNode]) (or any isomorphic type)
  • Document to use the replaced Element
  • Event to take all the above changes into account

Rather than deviating from xml-types, maybe we could push such changes into xml-types directly ? I'm not sure having xml-conduit maintain its own data types to represent XML is a great deal for its users, nor for its maintainers :) .

Also, should the Event type also maintain the invariant "not (2 consecutive nodes are textual nodes)" ? If yes, then how would you redefine Event ?

My 2 cents is that this is a non-problem, and any attempt to fix this will have massive API breakage and likely introduce some serious performance impacts. If handled incorrectly, it could even expose a major security flaw with OOM attacks via streaming large amounts of text as a single event.

Since TextNode is equivalent to just Text (or Content for unresolved), I was thinking we could make things a little simpler than the above:

  • Node loses the NodeContent constructor

  • Element replaces [Node] with something isomorphic to (Text, [Node, Text])

  • Document to use the replaced Element

And yeah this seems like it should be pushed into xml-types.

I don't thing Event needs to be changed, it already has implicit invariants such as End events matching Begin events.

My 2 cents is that this is a non-problem

I mean depends on your goals I suppose. For using XML for bidirectional serialization it seems pretty important. But if you are only expecting users to go in one direction and use other libraries/formats for bidirectionality then sure.

and any attempt to fix this will have massive API breakage and likely introduce some serious performance impacts.

I'm unsure why there would be any performance impact, but I could be wrong. But yeah there will be some API breakage, although on the plus side converting between the new and old format will be fairly trivial, but unfortunately still a breaking change.

If handled incorrectly, it could even expose a major security flaw with OOM attacks via streaming large amounts of text as a single event.

I'm extremely unsure how any of this would change, it's just a small change from [Node] to (Text, [Node, Text]) which is very near isomorphic to the original type. Unless you are saying that the current implementation intentionally breaks up Text into multiple chunked sections sometimes ([NodeContent "<lotsoftext>", NodeContent "<lotsmoretext>"]).

The only real problem I'm seeing implied is that it makes it difficult to test roundtripping in general. A single normalize function would address that case. I'm not aware of other use cases where there's a desire for ensuring a single possible in memory representation exists.

Unless you are saying that the current implementation intentionally breaks up Text into multiple chunked sections sometimes

This happens in the Event layer right now. I don't remember offhand if those separate chunks are retained in the Node layer.

it seems pretty important

I'd rather see explanations of why this is important than a claim of it. Most serialization formats do not have a guarantee of bidirectionality in all cases. Perhaps aeson does when going Value ==> ByteString ==> Value, but it certainly doesn't from ByteString ==> Value ==> ByteString, where it may rearrange keys and change whitespace.

I'd rather see explanations of why this is important than a claim of it. Most serialization formats do not have a guarantee of bidirectionality in all cases. Perhaps aeson does when going Value ==> ByteString ==> Value, but it certainly doesn't from ByteString ==> Value ==> ByteString, where it may rearrange keys and change whitespace.

I feel like this is the single most important aspect of serialization formats.

When you want to serialize something, the main thing you care about is being able to get the exact original value back when you deserialize.

The easiest way to ensure the above is to make sure that every transformation toward the final serialized value is injective/bidirectional.

For example in Aeson: MyType -> Value is injective, Value -> ByteString is injective, ByteString -> whatever socket/stream is injective. Therefore the whole thing is injective and bidirectional.

Anytime any layer in the process deviates from this it tends to cause problems. For example right now Maybe (Maybe a) -> Value is not injective, hence why this issue has been raised. I know that personally if I see a serialization issue the first thing I look for is Maybe nested with some other null-outputting type. I really wish I did not have to worry about that.

I can't help but feel that what is asked is something XML doesn't guarantee in the first place. After all if I'm not mistaken, NodeContent (ContentText in unresolved) are the Character Information Items from 2.6 of XML Infoset and per the same applications are free to group (or split: the logical unit in XML text is actually one character) as they want.

Specifically, how do you denote in the XML textual form the difference between, say, [NodeContent "x",NodeContent "yz"], [NodeContent "x", NodeContent "y", NodeContent "z"], and NodeContent "xyz" given that as far as XML is concerned they are one and the same?

I can't help but feel that what is asked is something XML doesn't guarantee in the first place.

XML doesn't take a stance one way or the other on whether or not any given type/serializer/deserializer combination is lossless.

In the most trivial case if your XML type is basically just the Haskell equivalent of:

record Xml : Set where
  field
    text : Text
    valid : ValidXml text

Then the serializer is trivially injective.

Specifically, how do you denote in the XML textual form the difference between, say, [NodeContent "x",NodeContent "yz"], [NodeContent "x", NodeContent "y", NodeContent "z"], and NodeContent "xyz" given that as far as XML is concerned they are one and the same?

Instead of trying to differentiate those values in XML, which I agree you cannot, you must un-differentiate them in Haskell.

I mentioned some ways to head in that direction above.

You're asking in the context of a serialisation in XML though. That means that if you can't differentiate them in XML, being able to differentiate them in Haskell at render time avails you little because you won't be able to differentiate them at parse time.

Also, in the particular case of successive NodeContent/ContentText, if I'm not mistaken and they represent XML's Character Information Items, as I understand the XML Infoset recommendation it actually takes the stance that one shouldn't count on a specific split.

I’m saying we should make them less differentiable in Haskell. The same XML should have the exact same Haskell representation. We should make it impossible to represent the same XML with two different Haskell values, which is totally possible.

The problem I see is that while trying for a given parse to always give the same result seems fair enough — and even there there's a wrinkle with what will the streaming parser do if it hits the end of a ByteString while parsing a Character Information Item — the other way around looks to me like it would make perfectly sane transformations take more work than needed. Like, say, remove all comments.

P.S.: Thinking of it, there's also that not only do two different Document can give the same XML but two different XML can give the same Document! XML not having a notion of attribute order, <t att1='v1' att2='v2'> and <t att2="v2" att1="v1"> give the same elementAttributes.