/slaw

Slaw is a lightweight library for rendering and generating Akoma Ntoso acts from plain text and PDF documents.

Primary LanguageRubyMIT LicenseMIT

Slaw Build Status Gem Version

Slaw is a lightweight library for generating Akoma Ntoso 3.0 Act XML from plain text documents. It is used to power Indigo and uses grammars developed for the legal tradition in South Africa, although others traditions are supported.

Slaw allows you to:

  1. parse plain text and transform it into an Akoma Ntoso Act XML document
  2. unparse Akoma Ntoso XML into a plain-text format suitable for re-parsing

Slaw is lightweight because it wraps around a Nokogiri XML representation of the parsed document. It provides some support methods for manipulating these documents, but anything advanced must manipulate the XML directly.

Installation

Add this line to your application's Gemfile:

gem 'slaw'

And then execute:

$ bundle

Or install it with:

$ gem install slaw

The simplest way to use Slaw is via the commandline:

$ slaw parse myfile.text --grammar za

Overview

Slaw generates Acts in the Akoma Ntoso 2.0 XML standard for legislative documents. It first parses plain text using a grammar and then generates XML from the resulting syntax tree.

Most by-laws in South Africa are available as PDF documents. You will therefore need to extract the text from the PDF first, using a tool like pdftotext. PDFs can product oddities (such as oddly wrapped lines) and Slaw has a number of rules-of-thumb for correcting these. These rules are based on South African by-laws and may not be suitable for all regions.

The grammar is expressed as a Treetop grammar and has been developed specifically for the format of South African acts and by-laws. Grammars for other regions could de developed depending on the complexity of a region's formats.

The grammar cannot catch some subtleties of an act or by-law -- such as nested list numbering -- so Slaw performs some post-processing on the XML produced by the parser. In particular, it nests lists correctly.

Parsing

Slaw uses Treetop to compile a grammar into a backtracking parser. The parser builds a parse tree, the nodes of which know how to serialize themselves in XML format.

Supporting formats from other country's legal traditions probably requires creating a new grammar and parser.

Adding your own grammar

Slaw can dynamically load your custom Treetop grammars. When called with --grammar xy, Slaw tries to require slaw/grammars/xy/act and instantiate the parser class Slaw::Grammars::XY::ActParser. Slaw always uses the rule act as the root of the parser.

You can create your own grammar by creating a gem that provides these files and classes.

Contributing

  1. Fork it at http://github.com/longhotsummer/slaw/fork
  2. Install dependencies: bundle install
  3. Create your feature branch: git checkout -b my-new-feature
  4. Write great code!
  5. Run tests: rspec
  6. Commit your changes: git commit -am 'Add some feature'
  7. Push to the branch: git push origin my-new-feature
  8. Create a new Pull Request

Releasing

  1. Update lib/slaw/version.rb
  2. Run rake release

Changelog

13.0.0 (28 June 2022)

  • Generate correct .../!main in FRBR URIs.

12.0.0 (31 January 2022)

  • Use <br/> for newlines in tables, rather than <eol/>, since it's more semantically correct.

11.0.0 (29 October 2021)

  • Prefix eId attributes in attachments with attachement's eId
  • Use crossHeading element for crossheadings

10.7.0 (11 June 2021)

  • Support underlines with __text__

10.6.0 (10 May 2021)

  • Handle sup and sub when extracting from HTML.

10.5.0 (20 April 2021)

  • Handle escaping inlines when unparsing.

10.4.1 (14 April 2021)

  • Handle escaping in inlines, so that forward slashes in link text are unescaped correctly, eg [https:\/\/example.com](https://example.com)

10.4.0 (9 April 2021)

  • Remove dependency on mimemagic. Guess file type based on filename instead.

10.3.1 (11 January 2021)

  • Strip ascii, unicode general and unicode supplemental punctuation from num elements when building eIds

10.2.0 (4 September 2020)

  • support inline superscript ^^text^^
  • support inline subscript _^text^_

10.1.0 (18 June 2020)

  • hcontainer elements have name attributes, to be compliant with AKN 3.0

10.0.0 (12 June 2020)

  • BREAKING: Create XML with AKN 3 namespace (http://docs.oasis-open.org/legaldocml/ns/akn/3.0), AKN2 is no longer supported
  • BREAKING: replace id attributes with eId attributes
  • BREAKING: serialize schedules as attachments to act, not as components as peers of the act
  • BREAKING: anonymous blocks are serialized as hcontainers, not paragraphs
  • BREAKING: crossheading hcontainer IDs correctly use hcontainer
  • Remove unnecessary schemaLocation header in root element

9.2.0 (10 June 2020)

  • Subpart numbers are optional

9.1.0 (15 April 2020)

  • Subsections can have numbers such as 1.1A and 1.1bis

9.0.0 (17 Mar 2020)

  • Support SUBPART

8.0.1 (26 Feb 2020)

  • Fix bug with id prefix on schedules container

8.0.0 (19 Feb 2020)

  • Obey --id-prefix for group nodes
  • Ensure that schedules prefix their children, for those that require it (parts and chapters)

7.0.0 (31 Jan 2020)

  • Lists ids are now numbered sequentially, rather than by tree position
  • New Slaw::Grammars::Counters helper module

6.2.0 (15 Jan 2020)

  • Better support for ol, ul and li when importing from HTML

6.1.0 (6 Jan 2020)

  • Support Chapters inside Parts

6.0.0 (7 Nov 2019)

  • Give grammars the opportunity to post-process generated XML
  • Move blocklist handling into postprocessing for ZA grammar
  • ZA grammar rewrites schedule aliases to include full text content of headings

5.0.0 (25 Oct 2019)

  • Schedules have a new grammar to make it easier for users to understand headings and subheadings.
  • The way schedule IDs are generated has been simplified.

4.2.0 (7 Sept 2019)

  • BODY is allowed to be empty

4.1.0 (4 June 2019)

  • BODY marks start of body

4.0.0 (29 May 2019)

  • Preserve whitespace for mixed content nodes
  • Don't pretty-print XML, as this can introduce meaningful whitespace

3.4.0 (20 May 2019)

  • Restructure subsections to support generic block elements, starting with an inline block element

3.3.3 (17 May 2019)

  • FIX bug where unparse was returning XML, not text

3.3.2 (15 May 2019)

  • Internal adjustments to make rules easier to override

3.3.1 (15 May 2019)

  • Crossheadings at start of body (ending preface and preamble)

3.3.0 (1 May 2019)

  • Only renest annotated blocklists
  • Table grammar uses additional rules and permits whitespace

3.2.0 (22 April 2019)

  • Permit inline content in chapter, part and section headings

3.1.1 (10 April 2019)

  • FIX don't error when a line is just a backslash

3.1.0 (29 March 2019)

  • Add --ascii flag to %-encode utf-8 strings into US-ASCII for speed. See cjheath/treetop#31

3.0.0 (28 March 2019)

  • Inline bold and italics
  • Support for CROSSHEADING elements using an empty hcontainer until we support AKN 3.0
  • Support for LONGTITLE in PREFACE
  • Remarks and references support nested inline elements
  • BREAKING: clauses rule renamed to inline_elements so as not to clash with real AKN clauses
  • BREAKING: block_paragraphs rule renamed to generic_container and adjusted to be singular to be simpler to understand
  • BREAKING: un-numbered paragraph elements have new ids, that should not clash with numbered paragraphs from other grammars

2.2.0 (18 March 2019)

  • Schedules use hcontainer, not article
  • Schedules allow rich content in title and heading

2.1.0 (18 March 2019)

  • Make subclassing preface statements easier

2.0.0 (15 March 2019)

  • Remove support for PDFs. Do text extraction from PDFs outside of this library.
  • Support dynamically loading grammars from other gems.
  • Don't change ALL CAPS headings to Sentence Case.

1.0.4 (5 February 2019)

1.0.3 (26 September 2018)

  • FIX bug in all grammars that dropped less-than symbols < from input text.

1.0.2 (2 June 2018)

  • FIX bug in ZA grammar when parsing dotted numbered subsections ending with a newline

1.0.1

  • Improved support for other legal traditions / grammars.
  • Add Polish legal tradition grammar.
  • Slaw no longer does too much introspection of a parsed document, since that can be so tradition-dependent.
  • Move reformatting out of Slaw since it's tradition-dependent.
  • Remove definition linking, Slaw no longer supports it.
  • Remove unused code for interacting with the internals of acts.

0.17.2

  • Match defined terms in 'definition' section.
  • Updated nokogiri dependency to 1.8.2

0.17.0

  • Support links and images inside tables, by parsing tables natively.

0.16.0

  • Support --crop for PDFs. Requires poppler pdftotex, not xpdf.

0.15.2

  • Update nokogiri to ~> 1.8.1

0.15.1

  • Ignore non-AKN compatible table attributes

0.15.0

  • Support tables in many non-PDF documents (eg. Word documents) by converting to HTML and then to Akoma Ntoso

0.14.2

  • Convert non-breaking space (\xA0) to space

0.14.1

  • Support links in remarks

0.14.0

  • Support inline image tags, using Markdown syntax: ![alt text](image url)
  • Smarter un-break lines

0.13.0

  • FIX allow Schedule, Part and other headings at the start of blocklist and subsections
  • FIX replace empty CONTENT elements with empty P tags so XML validates
  • Better handling of empty subsections and blocklist items

0.12.0

  • Support links/references using Markdown-like [text](href) syntax.
  • FIX allow remarks in blocklist items

0.11.0

  • Support newlines in table cells as EOL (or BR in HTML)
  • FIX unparsing of remarks, introduced in 0.10.0

0.10.1

  • Ensure backslash escaping handles listIntroductions and partial words correctly

0.10.0

  • New command unparse FILE which transforms an Akoma Ntoso XML document into plain text, suitable for re-parsing
  • Support escaping special words with a backslash

0.9.0

  • This release makes reasonably significant changes to generated XML, particularly for sections without explicit subsections.
  • Blocklists with (aa) following (z) are using the same numbering format.
  • Change how blockList listIntroduction elements are created to be more generic
  • Support for sections that dive straight into lists without subsections
  • Simplify grammar
  • Fix elements with potentially duplicate ids

0.8.3

  • During cleanup, break lines on section titles that don't have a space after the number, eg: "New section title 4.(1) The content..."

0.8.2

  • Schedules can be empty (#10)

0.8.1

  • Schedules can have both a title and a heading, permitting schedules titled "First Schedule" and not just "Schedule 1"

0.8.0

  • FEATURE: parse command only reformats input for PDFs or when --reformat is given
  • FIX: don't error on defn tags without link to defined term

0.7.4

  • use refersTo to identify blocks containing term definitions, rather than setting an (invalid) ID

0.7.3

  • add link-definitions command to find and extract defined terms and link them to their definitions
  • exit with non-zero exit code on failure (see rails/thor#244)

0.7.2

  • add --section-number-position argument to slaw command
  • grammar supports empty chapters and parts

0.7.1

  • major changes to grammar to permit chapters, parts, sections etc. in schedules