/rnv

:tropical_fish: Relax NG Compact Syntax validator by David Tolpin; official upstream maintenance repository

Primary LanguageCOtherNOASSERTION

RNV -- Relax NG Compact Syntax Validator in C

Version 1.7 

   Table of Contents

   News since 1.6
   New since 1.5 
   Aknowledgements
   Package Contents 
   Installation 
   Invocation 
   Limitations 
   Applications

        ARX 
        RVP 

   User-Defined Datatype Libraries

        Datatype Library Plug-in
        Scheme Datatypes

   New versions 

   Abstract

   RNV is an implementation of Relax NG Compact Syntax,
   http://relaxng.org/compact-20021121.html. It is written in ANSI C,
   the command-line utility uses Expat,
   http://www.jclark.com/xml/expat.html. It is distributed under BSD
   license, see license.txt for details.

   RNV is a part of an on-going work, and the current code can have bugs
   and shortcomings; however, it validates documents against a number of
   grammars. I use it.

News since 1.6

   The format for error messages is similar to that of Jing (file name,
   line and column are colon-separated). Entities and DTD processing is
   moved out of RNV, use XX, available from the same download location,
   to expand entities.

New since 1.5

   Better reporting: required and permitted content is reported
   separately; it helps debug grammars. Several bugfixes; I relied on an
   acquired test suite and published schemata, but have found that I can
   make more bugs than they cover, thus a reworked an extended test suite
   is now used for testing. The code has also been cleaned up and
   simplified in places during porting to Plan9.

Aknowledgements

   I would like to thank those who have helped me develop RNV.

   Dave Pawson has been the first user of the program.

   Alexander Peshkov helps me with testing and I have been able to
   correct very well hidden errors with his help.

   Sebastian Rahtz encouraged me to continue working on RNV since the
   first release, and has helped me to improve it on more than one
   occasion.

Package Contents

Note

   I have put rnv.exe and arx.exe, Win32 executables statically linked
   with a current version of Expat from
   http://expat.sourceforge.net/, into a separate distribution
   archive (with name ending in -win32bin). It contains only the program
   binaries and should be available from the same location as the source
   distribution.

   The package consists of:
     * the license, license.txt;
     * the source code, *.[ch];
     * the source code map, src.txt;
     * Makefile.bsd for BSD make;
     * Makefile.gnu for GNU Make;
     * Makefile.bcc for Win32 and Borland C/C++ Compiler;
     * tools/xck, a simple shell script I am using to validate documents;
     * tools/*.rnc, sample Relax NG grammars;
     * scm/*.scm, program modules in Scheme, for Scheme Datatypes
       Library;
     * the log of changes, changes.txt;
     * this file, readme.txt.
     * Other scripts, samples and plug-ins appear in tools/ eventually.

Installation

   On Unix-like systems, run make -f Makefile.gnu or make -f
   Makefile.bsd, depending on which flavour of make you have;
   Makefile.bsd should probably work on SysV, but, unfortunately, I have
   no place to check for the last couple of years. If you are using Expat
   1.2, define EXPAT_H as xmlparse.h instead of expat.h).

   On Windows, use rnv.exe. To recompile from the sources, use
   Makefile.bcc with Borland C/C++ Compiler, or create a makefile or
   project for your environment.

Invocation

   The command-line syntax is

        rnv {-q|-p|-c|-s|-v|-h} grammar.rnc {document1.xml}

   If no documents are specified, RNV attempts to read the XML document
   from the standard input. The options are:

   -q
          names of files being processed are not printed; in error
          messages, expected elements and attributes are not listed;

   -n <num>
          sets the maximum number of reported expected elements and
          attributes, -q sets this to 0 and can be overriden;

   -p
          copies the input to the output;

   -c
          if the only argument is a grammar, checks the grammar and
          exits;

   -s
          uses less memory and runs slower;

   -v
          prints version number;

   -h
          displays usage summary and exits.

Limitations

     * RNV assumes that the encoding of the syntax file is UTF-8.
     * Support for XML Schema Part 2: Datatypes is partial.
          + ordering for duration is not implemented;
          + only local parts of QName values are checked for equality,
            ENTITY values are only checked for lexical validity.
     * The schema parser does not check that all restrictions are obeyed,
       in particular, restrictions 7.3 and 7.4 are not checked.
     * RNV for Win32 platforms is a Unix program compiled on Win32. It
       expects file paths to be written with normal slashes; if a schema
       is in a different directory and includes or refers external files,
       then the schema's path must be written in the Unix way for the
       relative paths to work. For example, under Windows, rnv that uses
       ..\schema\docbook.rnc to validate userguide.dbx should be invoked
       as

      rnv.exe ../schema/docbook.rnc userguide.dbx

Applications

   The distribution includes several utilities built upon RNV; they are
   listed and described in the following sections.

ARX

   ARX is a tool to automatically determine the type of a document from
   its name and contents. It is inspired by James Clark's schema location
   approach for nXML,
   http://groups.yahoo.com/group/emacs-nxml-mode/message/259, and is
   a development of the idea described in
   http://relaxng.org/pipermail/relaxng-user/2003-December/000214.htm
   l.

   ARX is a command-line utility. The invocation syntax is

        arx {-n|-v|-h} document.xml  arx.conf {arx.conf}

   ARX either prints a string corresponding to the document's type or
   nothing if the type cannot be determined. The options are:

   -n
          turns off prepending base path of the configuration file to the
          result, even if it looks like a relative path (useful when the
          configuration file and the grammars are in separate
          directories, or for association with something that is not a
          file);

   -v
          prints current version;

   -h
          displays usage summary and exits.

   The configuration file must conform to the following grammar:

      arx = grammars route*
      grammars = "grammars"  "{" type2string+ "}"
      type2string =  type "=" literal
      type = nmtoken
      route = match|nomatch|valid|invalid
      match = "=~" regexp "=>" type
      nomatch = "!~" regexp "=>" type
      valid = "valid" "{" rng "}" "=>" type
      invalid = "!valid" "{" rng "}" "=>" type

      literal=string in '"', '"' inside must be prepended by '\'
      regexp=string in '/', '/' inside must be prepended by '\'
      rng=Relax NG Compact Syntax

      Comments start with # and continue till the end of line.

   Rules are processed sequentially, the first matching rule determines
   the file's type. Relax NG templates are matched against file contents,
   regular expressions are applied to file names. The sample below
   associates documents with grammars for XSLT, DocBook or XSL FO.

      grammars {
        docbook="docbook.rnc"
        xslt="xslt.rnc"
        xslfo="fo.rnc"
      }

      valid {
        start = element (book|article|chapter|reference) {any}
        any = (element * {any}|attribute * {text}|text)*
      } => docbook

      !valid {
        default namespace xsl = "http://www.w3.org/1999/XSL/Transform"
        start = element *-xsl:* {not-xsl}
        not-xsl = (element *-xsl:* {not-xsl}|attribute * {text}|text)*
      } => xslt

      =~/.*\.xsl/ => xslt
      =~/.*\.fo/ => xslfo

   ARX can also be used to link documents to any type of information or
   processing.

RVP

   RVP is abbreviation for Relax NG Validation Pipe. It reads validation
   primitives from the standard input and reports result to the standard
   output; it's main purpose is to ease embedding of a Relax NG validator
   into various languages and environment. An application would launch
   RVP as a parallel process and use a simple protocol to perform
   validation. The protocol, in BNF, is:

     query ::= (
           quit
         | start
         | start-tag-open
         | attribute
         | start-tag-close
         | text
         | end-tag) z.
       quit ::= "quit".
       start ::= "start" [gramno].
       start-tag-open ::= "start-tag-open" patno name.
       attribute ::= "attribute" patno name value.
       start-tag-close :: = "start-tag-close" patno name.
       text ::= ("text"|"mixed") patno text.
       end-tag ::= "end-tag" patno name.
     response ::= (ok | er | error) z.
       ok ::= "ok" patno.
       er ::= "er" patno erno.
       error ::= "error" patno erno error.
     z ::= "\0" .

     * RVP assumes that the last colon in a name separates the local part
       from the namespace URI (it is what one gets if specifies `:' as
       namespace separator to Expat).
     * Error codes can be grabbed from rvp sources by grep _ER_ *.h and
       OR-ing them with corresponding masks from erbit.h. Additionally,
       error 0 is the protocol format error.
     * Either er or error responses are returned, not both; -q chooses
       between concise and verbose forms (invocation syntax described
       later).
     * start passes the index of a grammar (first grammar in the list of
       command-line arguments has number 0); if the number is omitted, 0
       is assumed.
     * quit is not opposite of start; instead, it quits RVP.

   The command-line syntax is:

        rvp {-q|-s|-v|-h} {schema.rnc}

   The options are:

   -q
          returns only error numbers, suppresses messages;

   -s
          takes less memory and runs slower;

   -v
          prints current version;

   -h
          displays usage summary and exits.

   To assist embedding RVP, samples in Perl (tools/rvp.pl) and Python
   (tools/rvp.py) are provided. The scripts use Expat wrappers for each
   of the languages to parse documents; they take a Relax NG grammar (in
   the compact syntax) as the command line argument and read the XML from
   the standard input. For example, the following commands validate
   rnv.dbx against docbook.rnc:

      perl rvp.pl docbook.rnc < rnv.dbx
      python rvp.py docbook.rnc < rnv.dbx

   The scripts are kept simple and unobscured to illustrate the
   technique, rather than being designed as general-purpose modules.
   Programmers using Perl, Python, Ruby and other languages are
   encouraged to implement and share reusable RVP-based components for
   their languages of choice.

User-Defined Datatype Libraries

   Relax NG relies on XML Schema Datatypes to check validity of data in
   an XML document. The specification allows the implementation to
   support other datatype libraries, a library is required to provide two
   services, datatypeAllows and datatypeEqual.

   A powerful and popular technique is the use of string regular
   expressions to restrict values of attributes and character data.
   However, XML Schema regular expressions must be written as single
   strings, without any parameterization; they often grow to several
   dozens of characters in length and are very hard to read or debug.

   A solution for these problem would be to allow the user to define
   custom datatypes and to specify them in a high-level programming
   language. The user can then either use regular expressions as such,
   employ lex for lexical analysis, or any other technique which is best
   suited for each particular case (for example XSL FO datatypes would
   benefit from a custom datatype library). With many datatype libraries
   eventually implemented, it is likely that a clearer picture of the
   right language for validation of data will eventually emerge.

   RNV provides two different ways to implement this solution; I believe
   that they correspond to different tastes and traditions. In both
   cases, a high-level language can be used to implement a datatype
   library, the language is not related to the implementation language of
   RNV, and RNV need not be recompiled to add a new datatype library.

Datatype Library Plug-in

   A datatype plug-in is an executable. RNV invokes it as either
  program allows type key value ... data

   or
  program equal type data1 data2

   program is the executable's, name, the rest is the command line; key
   and value pairs are datatype parameters and can be repeated. The
   program is executed for each datatype in library
   http://davidashen.net/relaxng/pluggable-datatypes; if the exit status
   is 0 for success, non-zero for failure.

   Both RNV and RVP can use pluggable datatypes, and must be compiled
   with DXL_EXC set to 1 (make DXL_EXC=1) to support them, in which case
   they accept an additional command-line option -d with the name of the
   plugin as the argument. An implementation of XML Schema datatypes as a
   plugin (in C) is included in the distribution, see xsdck.c. For
   example,
    rnv -d xsdck xslt-dxl.rnc $HOME/work/docbook/xsl/*/*.xsl

   will validate all DocBook XSL stylesheets on my workstation against a
   grammar for XSLT 1.0 modified to use RNV Pluggable Datatypes Library
   instead of XML Schema Datatypes.

Scheme Datatypes

   Another way to add custom datatypes to RNV is to use the built-in
   Scheme interpeter (SCM,
   http://www.swiss.ai.mit.edu/~jaffer/SCM.html) to implement the
   library in Scheme, a dialect of Lisp. This solution is more flexible
   and robust than the previous one, but requires knowledge of a
   particular programming language (or at least desire to learn it, and
   the result is definitely worth the effort).

   To support it, SCM must be installed on the computer, and RNV or RVP
   must be compiled with DSL_SCM set to 1 (make DSL_SCM=1), in which case
   they accept an additional option -e with the name of a scheme program
   as an argument. The datatype library is bound to
   http://davidashen.net/relaxng/scheme-datatypes; a sample
   implementation is in scm/dsl.scm. For example,
    rnv -e scm/dsl.scm xslt-dsl.rnc $HOME/work/docbook/xsl/*/*.xsl

   check the stylesheets against an XSLT 1.0 grammar modified to use an
   RNV Scheme Datatypes Library implemented in scm/dsl.scm.

   A Datatype Library in Scheme must provide two functions in top-level
   environment:
(dsl-equal? string string string)

   and
(dsl-allows? string '((string . string)*) string)

   To assist development of datatype libraries, a Scheme implementation
   of XML Schema Regular Expressions is included in the distribution as
   scm/rx.scm. The Regular Expression library is not just a way to
   re-implement the built-in datatypes. Owing to flexibility of the
   language it is much easier to write and debug regular expressions in
   Scheme, even if they are to be used with built-in XML Schema Datatypes
   in the end. For example, a regular expression for e-mail address, with
   insignificant simplifications, is:
    pattern=
      "(\(([^\(\)\\]|\\.)*\) )?"
    ~ "([a-zA-Z0-9!#$%&'*+\-/=?\^_`{|}~]+"
    ~ "(\.[a-zA-Z0-9!#$%&'*+\-/=?\^_`{|}~]+)*"
    ~ """|"([^"\\]|\\.)*")"""
    ~ "@"
    ~ "([a-zA-Z0-9!#$%&'*+\-/=?\^_`{|}~]+"
    ~ "(\.[a-zA-Z0-9!#$%&'*+\-/=?\^_`{|}~]+)*"
    ~ "|\[([^\[\]\\]|\\.)*\])"
    ~ "( \(([^\(\)\\]|\\.)*\))?"

   which, even split into four lines, is ugly-looking and hard to read.
   Meanwhile, it consists of a few repeating subexpressions, which could
   easily be factored out, but the syntax does not have the means for
   that.

   Using Scheme interpreter, it is as simple as
(define addr-spec-regex
  (let* (
      (atom "[a-zA-Z0-9!#$%&'*+\\-/=?\\^_`{|}~]+")
      (person "\"([^"\\\\]|\\\\.)\"")
      (location "\\[([^\\[\\]\\\\]|\\\\.)*\\]")
      (domain (string-append atom "(\\." atom ")*")))
    (string-append
      "(" domain "|" person ")"
      "@"
      "(" domain "|" location ")")))

   This code is much simpler to read and debug, and then the parts can be
   joined and added to the grammar for production use. Furthermore, it is
   easy to implement the parsing of structured regular expressions
   embedded into parameters of datatypes in Relax NG itself. dsl.scm, the
   sample datatype library, can handle parameter s-pattern with regular
   expressions split into named parts, and the example above becomes:
    s-pattern="""
      comment = "\(([^\(\)\\]|\\.)*\)"
      atom = "[a-zA-Z0-9!#$%&'*+\-/=?\^_`{|}~]+"
      atoms = atom "(\." atom ")*"
      person = "\"([^\"\\]|\\.)*\""
      location = "\[([^\[\]\\]|\\.)*\]"
      local-part = "(" atom "|" person ")"
      domain = "(" atoms "|" location ")"
      start = "(" comment " )?" local-part "@" domain "( " comment ")?"
    """

   addr-spec-dsl.rnc is included in the distribution.

New versions

   Visit http://davidashen.net/ for news and downloads.