golang/go

encoding/xml: proposed fixes for namespaces

pdw-mb opened this issue · 10 comments

Issue #13400 lists a number of issues related to namespace handling in encoding/xml. It's been noted that this area needs a bit of a rethink. This issue documents a set of proposed fixes to address the problems currently seen. I've grouped the current set of bugs into 7 separate topics to be addressed:

1. Lack of control over prefixes

Encoder currently provides no mechanism for the user to specify namespace/prefix bindings. In theory, this shouldn't matter: documents using different prefixes for the same namespaces are equivalent. In practice, people often do care.

For namespaced attributes, Encoder generates namespace prefixes based on the last part of the URI, and for elements, it redeclares the default namespace as required. This results in documents that are technically correct, but not what the user wants, and the generated prefixes may be cumbersome.

This raises the question of how much control we should give over prefixes.

XML allows quite a lot of complexity: prefixes can be rebound to different namespace URIs within a document, and the same namespace URI can simultaneously be bound to multiple prefixes. There's
a very old post on xml-dev that make a plea to produce "sane" documents, that is, ones where a given prefix is only ever used to refer to a single namespace URI, and a given namespace URI is only ever represented by a single prefix:

http://lists.xml.org/archives/xml-dev/200204/msg00170.html

I suggest that it is sufficient that we support the creation of "sane" documents, noting that we should allow a single URI to be represented by both a prefix and the default namespace within a document (this is effectively what Encoder does currently).

I think that the current approach of using the default namespace for elements, and generated prefixes for attributes is a good default: generated prefixes may be ugly, so we should use them only where needed (i.e. attributes), but should provide a mechanism for users to specify their own.

Proposed approach:

  • Allow users to specify a map of "preferred prefixes" on Encoder, that obeys the "sane" constraints. e.g. Encoder.AddNamespaceBindings(map[string]string])
  • When marshaling a namespaced attribute, use a preferred prefix, if available, and a generated one if not.
  • When marshaling a namespaced element, use a preferred prefix if available, and the default namespace otherwise.

Notes/questions:

  • The code changes required for this are pretty simple, although it does mean maintaining a notion of "preferred" and "generated" prefixes.
  • We probably want to allow users to specify additional preferred prefixes during the marshaling process (see point 2). This raises a question of whether we want to enforce that the document is "sane" - a user could potentially specify different bindings for the same namespace in different subtrees, e.g.:
  <foo>
    <x:bar xmlns:x="blort" />
    <y:baz xmlns:y="blort" />
  </foo>
  • Should preferred prefixes be output as soon as they're defined, or only when used? The former should probably be the default, but we might want to provide an option to control this, as it would allow the user to chuck a big bucket of "preferred prefixes" at the encoder, without bloating the document with unused prefixes

Issues addressed:
#11496: Serializing XML with namespace prefix
#9519: support for XML namespace prefixes

I think #9519 is based on a misunderstanding of how the current system works, but it seems likely that the user actually wants control over prefix names. I'm not sure if the reporter realises that a prefix is meaningless (and illegal) without a namespace URI binding.

2. Inability to access/set namespace declaration (handling QName values)

Namespace bindings are sometimes used by element and attribute values. For example:

  <foo xmlns:a="bar">a:blort</foo>

In order to correctly understand "a:blort" you need to know the currently effective namespace bindings. The same problem exists in reverse when encoding: you need to make sure that necessary namespace declarations are in place in the document.

Proposed approach:

We need to allow Unmarshalers and Marshalers to obtain and insert namespace bindings respectively. This means:

  1. A method on Decoder to expose current namespace bindings (trivial - it's already present privately)
  2. A change to UnmarshalerXMLAttr, as this does't currently provide the decoder. The safe way to make this change would be to create a new interface (e.g. UnmarshalerXMLAttrWithDecoder).
  3. Provide a method for Marshalers to inject namespace bindings. I suggest doing this by providing a method on Encoder to obtain a prefix for a namespace (GetPrefix ?), which will then take care of declaring the namespace if it hasn't yet been used. If the user cares what prefix they get, they should provide a preferred prefix prior to making the call to obtain one.
  4. As a convenience, we should make XMLName a MarshalerXML/UnmarshalerXML

Issues addressed:
#12406: support QName values / expose namespace bindings

3. Specifying namespaces in tags is cumbersome

Currently namespaces for elements may only be specified by including the full namespace URI, e.g.:

  `xml:"http://www.example.com/some/namespace/v1 foo"`

Aside from being verbose and repetitive, it means URIs can't be changed at runtime. It's not uncommon to want to use the same struct for different namespaces, for example, where version number in the namespace has changed, or as per #12624, to cope with documents using a subtlely
wrong namespace.

Proposed solution:

Given the mechanism in (1) to allow the user to specify namespaces/prefix mappings, it makes it possible for a struct to unambiguously use prefixes to reference namespaces. The obvious notation
is QName notation:

  `xml:"nsv1:foo"`

Under this proposal it would be an error to use a prefix that hadn't been explicitly specified for the user (i.e. it won't use prefixes picked up from the document when decoding). Users might be surprised that the above wouldn't match the following document unless they'd
explicitly set the prefix "nsv1" on the Decoder:

   <nsv1:foo xmlns="...">bar</nsv1:foo>

but doing so would be inherently fragile, as it wouldn't work with the entirely equivalent:

  <foo xmlns="...">bar</foo>

Notes:

This proposal changes the behaviour of Encoder/Decoder for tags with a colon in them, which it's possible that existing code relies on. On the other hand the current behaviour of such tags is clearly a source of confusion and bugs and doesn't work for Decoding anyway (see #11496)

Issues addressed:
#9775: Unmarshal does not properly handle NCName in XML namespaces

I think the bug as described is invalid: it's not clear what you'd expect to happen given that the namespace being used is undeclared.
#12624: brittle support for matching a namespace by identifier or url

The exact requirement behind this bug is not totally clear: it appears that the user wants unmarshaled elements that have one of a number of namespaces. I don't understand "Xmlns wasn't defined, but the namespace was used (ie. for mRSS with media namespace)" - that sounds like invalid
XML.

4. "No namespace" indistinguishable from "any namespace" in struct tags

When decoding, the tag xml:"foo" means element "foo" in any namespace. There's no way to say that you want foo in the null namespace. i.e.

   <foo xmlns="" />

This is a problem if a namespaced sibling of the same localname also exists. #8535 demonstrates this quite clearly.

Proposed approach:

Introduce a way of explicitly referencing the null namespace, e.g.

  `xml:"_ foo"`

We could go for the logical, but horribly subtle:

  `xml:" foo"`

(note the space before foo)

Issues addressed:
#8535: failure to handle conflicting tags in different namespaces
#11724: namespaced and non-namespaced attributes conflict

5. Bug: default namespace not set to null on un-namespaced children

It's not currently possible to produce the following XML:

<a xmlns="b">
  <c xmlns=""/>
</a>

If you produce <c> with a tag of:

`xml:"c"`

No xmlns declaration will be added, so <c> will inherit the namespace of it's parent <a>. This is related to issue (4): we don't currently distinguish between "any namespace" and "no namespace".

I can see two possible solutions here:

  1. Treat xml:"c" as meaning "no namespace" and insert xmlns="" as required to make that so.
  2. Treat xml:"c" as meaning "any namespace" and make it inherit the namespace of its parent. If you really want no namespace, use xml:"_ c" (or whatever notation we settle on for (4))

I can see arguments both ways.

Issues addressed
#7113: encoding/xml: missing nested namespace not handled

6. Bug: xmlns attributes not removed from Token (Decode/Encode not idempotent)

Decoder includes xmlns attributes in start element tokens. For example, an attribute of xmlns:foo="bar" would be included as an attribute with a name of {Space: "xmlns", Local: "foo"}. This is very dubious. xmlns attributes are special, but which ever way you look at it "xmlns" is not
a namespace URI - if anything, it's a prefix.

This creates problems if you feed the output of a Decoder into an Encoder, as it treats "xmlns" as a namespace URI, and introduces namespace declarations for it.

There's no good reason to include these attributes. It's reasonable to expose the current set of namespace bindings (see point 2), but the attributes themselves are not needed. If a user really wants to do their own namespace processing, they should use RawToken.

Proposed solution:

  • xmlns and xmlns:foo attributes should be stripped from the list of attributes returned by Token. They should be retained on RawToken.

Issues addressed:
#7535 Encoder duplicates namespace tags

7. Specifying xmlns attributes manually: allow or disallow?

Should we allow users to manually insert xmlns:* or xmlns="..." attributes?
#8167: disallow attributes named xmlns:*
#11431: encoding/xml: loss of xmlns= in encoding since Go 1.4

I don't think we need to support this, given the mechanism introduced under (1) and (3) above. One of the reasons why you might want to do it, is because namespace URIs are otherwise hard-coded into struct tags. The solution to (3) gives us a mechanism to avoid this.

That said, I'm struggling to see why we couldn't treat this as a call to add a preferred prefix - although there's a question of whether it should force the creation of the xmlns declaration if it's already in
scope.

8. Other issues

#11735: empty namespace conventions are badly documented

Yes, this should be clearer.
#8068: encoding/xml: empty namespace prefix definitions should be illegal

It sounds like this should be resolved as invalid.

  1. "No namespace" indistinguishable from "any namespace" in struct tags

If 4 is implemented, it would be nice if the null value could be on the namespace or on the local element name. Eg. if I want to unmarshal either of the following two XMPP errors:

<defined-condition xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/>
<invalid-xml xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/>

I might do something like the following:

`xml:"urn:ietf:params:xml:ns:xmpp-stanzas _"`

This could also use the extra space formatting, but that makes things unreadable at a glance, so I definitely wouldn't recommend it.

An interesting use case. Unfortunately, I think it risks creating confusion: (4) suggests using _ to mean "no namespace" as opposed "any namespace". Your use case is asking for "any localname", so is not consistent with the semantics we're assigning to _. We could consider another character (e.g. *). It probably warrants a separate issue, as I think it's independent of what we do for namespaces.

Ah, I think I was misunderstanding what 4 was proposing. I've filed #14433 as a separate issue instead. Sorry for the noise.

I am working on this issue now as my company needs to have this functionality gap covered ASAP. I would like to send some patch soon (i.e. this week).

Is there any patches proposed for this yet, or a branch in progress?

iwdgo commented
  1. Lack of control over prefixes
    #11496 commented as a solution is available in the current version.
    #9519 condensed notation (self -closing tag, prefixed tag name,..) is easily fixed but would not be configurable, i.e. only the shortest notation would be available.
  2. Inability to access/set namespace declaration (handling QName values)
    #12406: support QName values / expose namespace bindings
    Maps and arrays are available in the decoder but are currently unavailable.
  3. Specifying namespaces in tags is cumbersome
    Marshal and unmarshal documentation must be read together.
    #9775: Commented - invalid syntax
    #12624: Commented - invalid syntax
  4. "No namespace" indistinguishable from "any namespace" in struct tags
    #8535: Fix submitted
    #11724: Same fix (duplicate)
  5. Bug: default namespace not set to null on un-namespaced children
    #7113: encoding/xml: missing nested namespace not handled
    No fix yet but issue required other fixes before fixing it.
  6. Bug: xmlns attributes not removed from Token (Decode/Encode not idempotent)
    #7535 Fix submitted
iwdgo commented

#7113 has a proposed fix. This fix keeps the tracking of depth active even when no indent is required. The request to expose the namespace prefixes (i.e. the existing maps) of the tags is fairly simple but there is loss of information on the actual structure of the XML document, i.e. usage might be confusing.

Any likely progress on this issue ?

I am currently having to 'fix' namespaces/bindings in XML documents by passing them through libxml for namespace cleanup. This is less than ideal.

I played on a fork of encoding/xml in my repository github.com/m29h/xml/

  1. It produces always prefixes for namespaced elements in the same way as it is standard for namespaced attributes
  2. The element attributes are sorted before rendering as per the rules of C14N-XML. The byte sequence marshaled output will (at least in typical cases) be C14N Canonical XML.
  3. It can be a drop in replacement for encoding/xml that only changes marshaling behaviour but does not change/limit any interface or other feature today known from encoding/xml.
  4. No external dependencies of the module. To ensure that this is a serious attempt, I kept/adapted 100% of the existing unit tests from encoding/xml and all of them pass with the new serialization behavior.

It does not address all of the issues addressed above, but at least relieves my own pain in my particular use-case.

For someone just looking a way to compose valid request for certain server, there is working workaround: #9519 (comment)