gnieh/fs2-data

`fs2.data.xml.XmlException: character 'ʿ' cannot start a NCName`

armanbilge opened this issue · 6 comments

Via http4s/http4s-scala-xml#25 (comment).

//> using scala "3.1.2"
//> using lib "org.gnieh::fs2-data-xml-scala::1.4.1"

import cats.effect.*
import fs2.*
import scala.xml.*

val xml = """<Ẵ줐샃뗧饜孫 悊頃ふ퉞="ꨍ邭䋒↏᲎ừ" 듸괎:ʿक턻뽜="촏"/>"""

object App extends IOApp.Simple {

  def run = for
    _ <- IO(XML.loadString(xml)) *> IO.println("scala-xml works")
    _ <- Stream.emit(xml).covary[IO].through(fs2.data.xml.events()).compile.drain *> IO.println("fs2-data works")
  yield ()

}
scala-xml works
fs2.data.xml.XmlException: character 'ʿ' cannot start a NCName
        at fs2.data.xml.internals.EventParser$.fail$1$$anonfun$1(EventParser.scala:40)
        at fs2.Pull$$anon$2.cont(Pull.scala:183)
        at fs2.Pull$BindBind.cont(Pull.scala:701)
        at fs2.Pull$ContP.apply(Pull.scala:649)
        at fs2.Pull$ContP.apply$(Pull.scala:648)
        at fs2.Pull$Bind.apply(Pull.scala:657)
        at fs2.Pull$Bind.apply(Pull.scala:657)
        at fs2.Pull$.go$1$$anonfun$1(Pull.scala:1207)
        at fs2.Pull$.interruptGuard$1$$anonfun$1(Pull.scala:933)
        at get @ fs2.internal.Scope.openScope(Scope.scala:281)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Pull$.goCloseScope$1$$anonfun$1$$anonfun$3(Pull.scala:1187)
        at update @ fs2.internal.Scope.releaseChildScope(Scope.scala:227)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at modify @ fs2.internal.Scope.close(Scope.scala:262)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Pull$.goCloseScope$1$$anonfun$1(Pull.scala:1188)
        at handleErrorWith @ fs2.Compiler$Target.handleErrorWith(Compiler.scala:160)
        at flatMap @ fs2.Pull$.goCloseScope$1(Pull.scala:1195)
        at get @ fs2.internal.Scope.openScope(Scope.scala:281)

Adding Scalacheck-based tests as proposed in scala/scala-xml#606 would help catch these in fs2-data itself.

I fear this is a limitation of the current character enumeration method. I need to dig deeper.

After investigating more I understood what the problem is. The fs2-data XML parser uses XML namespace, which restricts the range of valid element identifier.

The character classes defined here can be derived from the Unicode 2.0 character database as follows:

Name start characters must have one of the categories Ll, Lu, Lo, Lt, Nl.

Name characters other than Name-start characters must have one of the categories Mc, Me, Mn, Lm, or Nd.

I might change this, to make it optional through an option (NCName parsing or not).

Would that be acceptable to you?

It looks I was referring to an obsolete version of names, I need to change it, actually…

Glad you figured it out. I have no clue about this stuff, just reporting the discrepancy I discovered. Appreciate your work!!

Btw Ross ended up publishing scalacheck instances for scala-xml:
https://github.com/typelevel/scalacheck-xml