lucidsoftware/xtract

How do I read a node that has a text value and child elements?

Opened this issue · 4 comments

I have an XML element that looks like

<Card>
    4111111111111111
    <Type>VISA</Type>
</Card>

that I'm trying to write an XmlReader for. My case class looks like:

case class Card(
    number: String,
    cardType: String
)

The problem I'm having is trying to extract the card number. I've played around on the REPL and looked at the source code. My first thought was I could just read the root element

scala> val xml = <Card>
     |   4111111111111111
     |   <Type>VISA</Type>
     | </Card>
xml: scala.xml.Elem =
<Card>
	4111111111111111
	<Type>VISA</Type>
</Card>

scala> val path = __
path: com.lucidchart.open.xtract.XPath.type =

scala> path(xml)
res0: scala.xml.NodeSeq =
<Card>
	4111111111111111
	<Type>VISA</Type>
</Card>

scala> val reader = path.read[String]
reader: com.lucidchart.open.xtract.XmlReader[String] = com.lucidchart.open.xtract.XmlReader$$anon$1@52f47f0c

scala> reader.read(xml)
res6: com.lucidchart.open.xtract.ParseResult[String] =
ParseSuccess(
	4111111111111111
	VISA
)

This is reading everything under the root node though. My next thought was maybe I could loop through the child nodes:

scala> path.children(xml)
res8: scala.xml.NodeSeq = NodeSeq(<Type>VISA</Type>)

but that doesn't return the text node. My last thought was what if <Card> wasn't the root element. Would that change anything:

scala> val xml = <Root>
     |   <Card>
     |           4111111111111111
     |           <Type>VISA</Type>
     |   </Card>
     | </Root>
xml: scala.xml.Elem =
<Root>
	<Card>
		4111111111111111
		<Type>VISA</Type>
	</Card>
</Root>

scala> val path = (__ \ "Card")
path: com.lucidchart.open.xtract.XPath = /Card

scala> path(xml)
res10: scala.xml.NodeSeq =
NodeSeq(<Card>
		4111111111111111
		<Type>VISA</Type>
	</Card>)

scala> path.read[String].read(xml)
res11: com.lucidchart.open.xtract.ParseResult[String] =
ParseSuccess(
		4111111111111111
		VISA
	)

So that seems to be giving the same behavior. It looks like under the hood stringReader is using the text function on NodeSeq

  /**
   * [[XmlReader]] matches the text of a single node.
   */
  implicit val stringReader: XmlReader[String] = XmlReader { xml =>
    getNode(xml).map(_.text)
  }

It looks like this behavior comes from there

scala> val xml: NodeSeq = <Card>
     |   4111111111111111
     |   <Type>VISA</Type>
     | </Card>
xml: scala.xml.NodeSeq =
<Card>
  4111111111111111
  <Type>VISA</Type>
</Card>

scala> xml.text
res16: String =
"
  4111111111111111
  VISA
"

Possibly related to #24?

I think the root problem is the way stringReader handles nodes that have a text element and children elements which I think is tied to how NodeSeq.text works. I think #24 is thwarting my efforts to work around the problem so in that way they're related.

But what should the behaviour be? should it use the concatenation of all text nodes that are direct children (but not descendents), use just the first text node, fail if there are element node children, or use a concatenation of all descendent text nodes (as is currently done)?

Changing the default way from the current behaviour is probably a breaking change, although I'm not sure if any existing usages make use of the current behaviour in the presence of child elements. I'm not necessarily opposed to changing the default way of parsing text. But to be honest, I kind of think that failing if there are child elements makes the most sense to me as the default. I pointed at #24 because it is a more general solution, though perhaps not as convenient in your case.

I could probably add something to extract the first text node or all direct child text nodes to the XPath and/or XmlReader API. Although I'm not sure what good names for those methods would be. Would that work for you?

You know, I may have forgotten an XML element could have multiple text nodes. I think you're making a lot of sense then. Trying to read a node with children as a string is probably undefined behaviour. I think a solution that returned either the direct child text nodes or all child nodes makes sense and then the user can do a collectFirst or mkstring or whatever based on their use case.