ocsigen/tyxml

Spaces between tags are treated as pcdata

Mbodin opened this issue ยท 7 comments

Mbodin commented

Consider the following OCaml file:

open Tyxml
let%svg test = {|<g> </g>|} (* Notice the space between the opening and closing of the g tag. *)

The content of the g tag is interpreted as a [> `PCDATA ] Tyxml_svg.elt which is incompatible with [< Svg_types.g_content ] Tyxml_svg.elt, and I get a type error. In other words the space between the opening and closing of the g tag is interpreted as a pcdata. The same goes if I place a newline instead of the space.

I'm surprised as the SVG specification itself uses a lot of spacing in its examples. For instance the following SVG is provided as a valid example in https://www.w3.org/TR/2003/REC-SVG11-20030114/struct.html#GroupsOverview despite having the same spacing between its g tags.

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" 
  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="4in" height="3in" version="1.1"
     xmlns="http://www.w3.org/2000/svg">
  <desc>Groups can nest
  </desc>
  <g>
     <g>
       <g>
       </g>
     </g>
   </g>
</svg>

I don't know the SVG specification enough to state what is meant to be there: should text with only white spaces be ignored instead of being treated as pcdata?

Drup commented

We had a similar issue in HTML (I though it was in table). Some elements should have a special parsing mode to be white-space insensitive. We introduced it for appropriate HTML elements, so it's a matter of reusing that for the appropriate SVG elements.

Mbodin commented

Thanks! I can see the discussion in #225.
I haven't found a place in the SVG specification listing which elements should be whitespace insensitive and which shouldn't. Would treating any tags not accepting pcdata as whitespace insensitive be too much?

Drup commented

I'm not quite sure. @aantron , you probably have a bit more expertise there, do you have an opinion ?

I don't have an opinion on this. All I can say is that for let%svg it should probably be parsed or adjusted according to the SVG specification, which I am not familiar with. For let%html it should be parsed according to the HTML specification. SVG tags inside an HTML document are not parsed according to the SVG specification, but according to the HTML rules for foreign elements inside HTML, which specifically includes rules for SVG. So on a syntax level, SVG inside HTML is not SVG proper, but looks like SVG.

Drup commented

So on a syntax level, SVG inside HTML is not SVG proper, but looks like SVG.

<insert appropriate swearing/> ๐Ÿคฆ

I think we can probably gloss over that.

@Mbodin I agree with your proposal, to ignore all whitespace for elements that don't accept pcdata. It might be worth glancing at the spec if something is specified in that regard, but otherwise, that will suffice.

Is there a fix for this?

It seems like pcdata should be allowed, especially for examples that use <text>somestuff</text>.

I have been unable to work with types like this which is unfortunate.

Mbodin commented

@ncitron I think that the above-mentioned branch https://github.com/Mbodin/tyxml/tree/whitespace (PR #331) fixes the issue.

pcdata is definitively allowed. I did not (at least intentionally) change the behaviour of any tag already accepting pcdata, so the behaviour of <text> should be unchanged. I however accepted white-space-only pcdata (which are ignored) in all the other tags.