AICC/CMI-5_Spec_Current

Implications of trimming whitespace from course structure

Closed this issue · 11 comments

The following xml is taken from the complex cmi5 course structure in the spec.

<url>
        http://courses.example.edu/identifiers/courses/d07e186b/blocks/001/aus/64f6/launch
</url>

If I read this verbatim, it is not a valid url. In order to extract a valid URL the whitespace would need to be trimmed (which is required by the specification).

If I trim the whitespace from it will mean that relative URL'S cannot begin with a space.

So if the url was:

<url> launch.html</url>

If the whitespace is removed the file would not be found in the course structure.

The following might be affected by whitespace trimming:

  • url (no relative url with leading or trailing spaces)
  • launchParameters
  • entitlementKey
  • Vendor Specific Metadata (Extensions)

I understand that this is a corner case but I think it is note worthy.

(for future readers, read the other comments as this one only addresses one narrow part of the spec and does not address other statements in the spec that modify final behavior)

Your example is a bit misleading, as it conflates XML parsing with browser rendering. This demonstrates that the browser's Javascript-based XML parser preserves all whitespace when reading XML (see demo). The <pre> tags are (nearly) the only way to stop an HTML renderer from consuming spaces.

And indeed, in a vacuum, this is correct. The XML specification is very clear about default parsing behavior, which is to preserve whitespace unless you explicitly provide an xml:space="default" (default is `xml:space="preserve"): https://www.w3.org/TR/xml/#sec-white-space

But we're also defining the Course Structure with an XSD, specifically https://github.com/AICC/CMI-5_Spec_Current/blob/quartz/v1/CourseStructure.xsd . If we refer to the XML Schema W3C spec: https://www.w3.org/TR/xmlschema11-2/#rf-whiteSpace , you'll see the following:

whiteSpace is applicable to all ·atomic· and ·list· datatypes. For all ·atomic· datatypes other than string (and types ·derived· by ·facet-based restriction· from it) the value of whiteSpace is collapse and cannot be changed by a schema author; for string the value of whiteSpace is preserve;

The url element is defined as:

      <xs:element name="url">
        <xs:simpleType>
          <xs:restriction base="xs:anyURI">
            <xs:minLength value="1"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:element>

If we follow xs:anyURI's definition: https://www.oreilly.com/library/view/xml-schema/0596002521/re56.html

<xs:simpleType name="anyURI" id="anyURI">
  <xs:restriction base="xs:anySimpleType">
    <xs:whiteSpace value="collapse" fixed="true"/>
  </xs:restriction></xs:simpleType>

You can see it has an xs:whiteSpace value of collapse, which is:

After the processing implied by replace, contiguous sequences of #x20's are collapsed to a single #x20, and any #x20 at the start or end of the string is then removed.

So for this one element, it's clear that it's okay to strip whitespace from the beginning and end of the element, and flatten any internal spacing. Your implementation, therefore, if it's using the XSD, should do this by default for this one element (but not necessarily all of them, you have to verify each one with the XSD, that's why it's there!) If you're not using the XSD, it's up to you to take the post-processed output of your XML parser and do what the XSD declares with each field.

@gavbaa apologies, I have just substantially changed my comment without seeing your reply.

This demonstrates that the browser's Javascript-based XML parser preserves all whitespace when reading XML (see demo).

This was very useful to me, thank you for taking the time to put that together.

To analyze other elements in the XSD, we need to know the following (XML Schema 2.4.1):

[Definition:] Atomic datatypes are those whose ·value spaces· contain only ·atomic values·. Atomic datatypes are anyAtomicType and all datatypes ·derived· from it.

anyURI, for example, is one of the atomic datatypes.

For the others, an xs:element without a named type is of type anyType, which is the parent of anySimpleType, which is the parent of anyAtomicType. Per the whitespace section, anyType isn't an atomic type, so standard XML rules would apply, which would be effectively equivalent to preserve.

That leaves us with:

So would you take issue with the following in the spec:

All leading/trailing whitespace MUST be removed by the LMS on import of the course structure for all of the data elements defined in this section.

So would you take issue with the follow in the spec:

All leading/trailing whitespace MUST be removed by the LMS on import of the course structure for all of the data elements defined in this section.

I don't take issue with it, it's a transformation step that occurs after the XML document is parsed per the XSD rules. The XML Course Structure doesn't have to be canonically parsed only from the XML parser, it's reasonable to have additional transformations occur after that point. Everything I wrote above is strictly about the types chosen and the behavior of XML parsers. The specification goes beyond that, and while it gives a nod to anyURI in the one type, that's superceded by the spec statement as it's done in post-processing.

Also, collapse implies flattening whitespace within the value, not just trimming whitespace on the value. As the spec is currently written, we are saying to preserve whitespace within the value, and trim whitespace outside it. There's no XSD whitespace operation match that correlates to that behavior.

The XML Course Structure doesn't have to be canonically parsed only from the XML parser, it's reasonable to have additional transformations occur after that point

Ok that makes sense.

I am concerned that by removing all leading/trailing whitespace on import that url, launchParameters, entitlementKey and Vendor Specific Metadata cannot begin with a space and therefore there is an implied requirement that isn't as clear as it could be.

I am imagining someone generating an entitlementKey with leading whitespace or an AU authoring tool that created filenames with spaces.

Do you have any thoughts on that?

I am imagining someone generating an entitlementKey with leading whitespace or an AU authoring tool that created filenames with spaces.

@thomasturrell, when you mention filenames with spaces I guess you mean that the filename would be used in the url XML element, like <url> launch.html</url>, or perhaps even like this <url> launch with lots of spaces.html </url>.

As I understand from comments by @gavbaa , the url XML element is defined to be a URI (xs:anyURI). And a URI (as defined in https://www.ietf.org/rfc/rfc2396.txt) can't contain spaces, they must be encoded as %20 (or + if the space is in a query string parameter).

Maybe it is not specified in the cmi5 spec (sorry, I haven't checked deeply) but at least I think it is reasonable that the URL values used in the url element should be similar to the values that can be used for the well-known HTML5 href attribute. This attribute can contain spaces and other whitespace at the start or end, but such whitespace will be trimmed before the value is used. And the rest of the value needs to be a valid relative or absolute URL.

I found some description of the href attribute here:
https://html.spec.whatwg.org/multipage/links.html#links-created-by-a-and-area-elements

The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.

And more definition of valid values here:
https://html.spec.whatwg.org/multipage/urls-and-fetching.html#terminology-2

A string is a valid non-empty URL if it is a valid URL string but it is not the empty string.
A string is a valid URL potentially surrounded by spaces if, after stripping leading and trailing ASCII whitespace from it, it is a valid URL string.
A string is a valid non-empty URL potentially surrounded by spaces if, after stripping leading and trailing ASCII whitespace from it, it is a valid non-empty URL.

My two cents are that leading or ending spaces in the url XML element will be ignored, so it is OK to keep the whitespace in the examples if it enhances readability. And a URL (after trimming leading and ending whitespace) can't contain any spaces in order for the URL to work. Any spaces (and other characters excluded from URIs) must be encoded.

So if the spaces are intended then the values must be like this: <url>%20launch.html</url>, or perhaps even like this example (with intended whitespaces at start and end, which will be ignored when parsing the XML) <url> launch%20with%20lots%20%20%20%20of%20spaces.html </url>.

Should the spec detail more about the values of the XML format, and especially about the expectations regarding encoding of the elements that have something do to with URLs and URL query string parameters?

@geirfp, thank you for you reply, it's very useful to get a broad perspective.

when you mention filenames with spaces I guess you mean that the filename would be used in the url XML element, like launch.html

Yes exactly that. When a course is packaged in ZIP Format the URL is meant to be relative (presumably relative to the cmi5.xml which is always in the root directory of the zip file).

My understanding is that xs:anyURI does not permit all of the same strings as a file path in a zip file (such as files paths that begin with spaces).

As well as spaces a file in a zip can contain # which would have special meaning in a URL.

The following is a valid filename:

<url>launch.html#hello</url>

The resulting URL might be something similar to:

https://example.com/launch.html#hello

But launch.html wouldn't exist on the server because the filename would be launch.html#hello!

So if the spaces are intended then the values must be like this: %20launch.html

One problem with that approach is a file in a zip can be prefixed with %20 or a space. %20 does not mean space in a filename in a zip file.

It's quite possibly an extreme edge case, and I am taking up too much time on this issue. However I usually find that if I can think of an edge case eventually someone will implement it.

IMO, it's invalid to try to use rules around what's contained in a ZIP package to have any overlap with what's put in a URL/URI/IRI storage field. They are completely separate indications of storage and routable location, and any overlap they have is coincidental.

Since we have the basis of "the attribute must be launchable in a web browser" (this isn't spec wording, I'm summarizing), it doesn't matter what's in the ZIP file. If someone names something weird in the ZIP, it's their responsibility to form a URI that is adequately capable of referencing that asset, just like they would have to to reference the asset as an import in any other HTML file that references that asset. Browser rules apply at that level, not ZIP rules.

And since that's the case, I don't see any point in spelling out those rules, they're defined more than adequately by various RFCs.

Per the July 16th Meeting, The group agreed that leading /trailing whitespace is addressed in section 13.1 of the specification and that XML processing rules can handle file naming/query string issues.