/ruby-ll

An LL(1) parser generator for Ruby.

Primary LanguageRubyMozilla Public License 2.0MPL-2.0

ruby-ll

ruby-ll is a high performance LL(1) table based parser generator for Ruby. The parser driver is written in C/Java to ensure good runtime performance, the compiler is written entirely in Ruby.

ruby-ll was written to serve as a fast and easy to use alternative to Racc for the various parsers used in Oga. However, ruby-ll isn't limited to just Oga, you can use it to write a parser for any language that can be represented using an LL(1) grammar.

ruby-ll is self-hosting, this allows one to use ruby-ll to modify its own parser. Self-hosting was achieved by bootstrapping the parser using a Racc parser that outputs the same AST as the ruby-ll parser. The Racc parser remains in the repository for historical purposes and in case it's ever needed again, it can be found in bootstrap/parser.y.

For more information on LL parsing, see https://en.wikipedia.org/wiki/LL_parser.

Features

  • Support for detecting first/first and first/follow conflicts.
  • clang-like error/warning messages to ease debugging parsers.
  • High performance and a low memory footprint.
  • Support for the *, + and ? operators.

Requirements

Ruby Required
MRI >= 2.6.0
JRuby >= 9.0
Rubinius Not supported
Maglev Not supported
Topaz Not supported
mruby Not supported

For MRI you'll need a C90 compatible compiler such as clang or gcc. For JRuby you don't need any compilers to be installed as the .jar is packaged with the Gem itself.

When hacking on ruby-ll you'll also need to have the following installed:

  • Ragel 6 for building the grammar lexer
  • javac for building the JRuby extension

Installation

ruby-ll can be installed from RubyGems:

gem install ruby-ll

Usage

The CLI takes a grammar input file (see below for the exact syntax) with the extension .rll and turns it into a corresponding Ruby file. For example:

ruby-ll lib/my-gem/parser.rll

This would result in the parser being written to lib/my-gem/parser.rb. If you want to customize the output path you can do so using the -o / --output options:

ruby-ll lib/my-gem/parser.rll -o lib/my-gem/my-parser.rb

By default ruby-ll adds various require calls to ensure you can load the parser without having to load all of ruby-ll (e.g. the compiler code). If you want to disable this behaviour you can use the --no-requires option when processing a grammar:

ruby-ll lib/my-gem/parser.rll --no-requires

Once generated you can use the parser class like any other parser. To start parsing simply call the parse method:

parser = MyGem::Parser.new

parser.parse

The return value of this method is whatever the root rule (= the first rule defined) returned.

Parser Input

For a parser to work it must receive its input from a separate lexer. To pass input to the parser you must define the method each_token in an %inner block. This method should yield an Array containing two values:

  1. The token type as a Symbol (e.g. :T_STRING)
  2. The token value, this can be any type of value

The last Array yielded by this method should be [-1, -1] to signal the end of the input. For example:

def each_token
  yield [:T_STRING, 'foo']
  yield [:T_STRING, 'bar']
  yield [-1, -1]
end

Error Handling

Parser errors are handled by LL::Driver#parser_error. By default this method raises an instance of LL::ParserError with a message depending on the current parser context and input. If you want to customize this behaviour simply overwrite the method (e.g. in an %inner block).

Grammar Syntax

The syntax of a ruby-ll grammar file is fairly simple and consists out of directives, rules, comments and code blocks.

Directives can be seen as configuration options, for example to set the name of the parser class. Rules are, well, the parsing rules. Code blocks can be used to associate Ruby code with either a branch of a rule or a certain section of the parser (the header or its inner body).

Directives and rules must be terminated using a semicolon, this is not needed for %inner / %header blocks.

For a full example, see ruby-ll's own parser located at lib/ll/parser.rll.

Comments

Comments start with a hash (#) sign and continue until the end of the line, just like Ruby. Example:

# Some say comments are a code smell.

%name

The %name directive is used to set the full name/namespace of the parser class. The name consists out of a single identifier or multiple identifiers separated by :: (just like Ruby). Some examples:

%name A;
%name A::B;
%name A::B::C;

The last identifier is used as the actual class name. This class will be nested inside a module for every other segment leading up to the last one. For example, this:

%name A;

Gets turned into this:

class A < LL::Driver

end

While this:

%name A::B::C;

Gets turned into this:

module A
module B
class C < LL::Driver

end
end
end

By nesting the parser class in modules any constants in the scope can be referred to without requiring the use of a full namespace. For example, the constant A::B::X can just be referred to as X in the above example.

Multiple calls to this directive will result in previous values being overwritten.

%terminals

The %terminals directive is used to list one or more terminals of the grammar. Each terminal is an identifier separated by a space. For example:

%terminals A B C;

This would define 3 terminals: A, B and C. While there's no specific requirement as to how you name your terminals it's common practise to capitalize them and prefix them with T_, like so:

%terminals T_A T_B T_C;

Multiple calls to this directive will result in the terminals being appended to the existing list.

%inner

The %inner directive can be used to specify a code block that should be placed inside the parser's body, just after the section containing all parsing tables. This directive should be used for adding custom methods and such to the parser. For example:

%inner
{
  def initialize(input)
    @input = input
  end
}

This would result in the following:

class A < LL::Driver
  def initialize(input)
    @input = input
  end
end

Curly braces can either be placed on the same line as the %inner directive or on a new line, it's up to you.

Unlike regular directives this directive should not be terminated using a semicolon.

%header

The %header directive is similar to the %inner directive in that it can be used to add a code block to the parser. The code of this directive is placed just before the class definition of the parser. This directive can be used to add documentation to the parser class. For example:

%header
{
  # Hello world
}

This would result in the following:

# Hello world
class A < LL::Driver
end

Rules

Rules consist out of a name followed by an equals sign (=) followed by 1 or more branches. Each branch is separated using a pipe (|). A branch can consist out of 1 or many steps, or an epsilon. Branches can be followed by a code block starting with { and ending with }. A rule must be terminated using a semicolon.

An epsilon is represented as a single underscore (_) and is used to denote a wildcard/nothingness.

A simple example:

%terminals A;

numbers = A | B;

Here the rule numbers is defined and has two branches. If we wanted a rule that would match terminal A or nothing we'd use the following:

%terminals A;

numbers = A | _;

Code blocks can also be added:

numbers
  = A { 'A' }
  | B { 'B' }
  ;

When the terminal A would be processed the returned value would be 'A', for terminal B the returned value would be 'B'.

Code blocks have access to an array called val which contains the values of every step of a branch. For example:

numbers = A B { val };

Here val would return [A, B]. Since val is just an Array you can also return specific elements from it:

numbers = A B { val[0] };

Values returned by code blocks are passed to whatever other rule called it. This allows code blocks to be used for building ASTs and the likes.

If no explicit code block is defined then ruby-ll will generate one for you. If a branch consists out of only a single step (e.g. A = B;) then only the first value is returned, otherwise all values are returned.

This means that in the following example the output will be whatever value C contains:

A = B { p val[0] };
B = C;

However, here the output would be [C, D] as the B rule's branch contains multiple steps:

A = B { p val[0] };
B = C D;

To summarize (# => denotes the return value):

A = B;   # => B
A = B C; # => [B, C]

You can override this behaviour simply by defining your own code block.

ruby-ll parsers recurse into rules before unwinding, this means that the inner-most rule is processed first.

Branches of a rule can also refer to other rules:

numbers    = A other_rule;
other_rule = B;

The value for other_rule in the numbers rule would be whatever the other_rule below it returns.

The grammar compiler adds errors whenever it encounters a rule with the same name as a terminal, as such the following is invalid:

%terminals A B;

A = B;

It's also an error to re-define an existing rule.

Operators

Grammars can use two operators to define a sequence of terminals/non-terminals: the star (*) and plus (+) operators. There's also the ? (question) operator which can be used to indicate something as being optional.

The star operator indicates that something should occur 0 or more times. Here the "B" identifier could occur 0 times, once, twice or many more times:

A = B*;

The plus operator indicates that something should occur at least once followed by any number of more occurrences. For example, this grammar states that "B" should occur at least once but can also occur, say, 10 times:

A = B+;

The question operator can be used as an alternative to the following pattern:

# "A" or "A C"
A = B A_follow;

A_follow = C | _;

Using this operator you can simply write the following:

A = B C?;

Operators can be applied either to a single terminal/rule or a series of terminals/rules grouped together using parenthesis. For example, both are perfectly valid:

A = B+;
A = (B C)+;

When calling an operator on a single terminal/rule the corresponding entry in the val array is simply set to the terminal/rule value. For example:

A = B+ { p val[0] };

For input B B B this would output [B, B, B].

However, when grouping multiple terminals/rules using parenthesis every occurrence is wrapped in an Array. For example:

A = (B C)+ { p val[0] };

For input B C B C this would output [[B, C], [B, C]]. To work around this you can simply move the group of identifiers to its own rule and only return whatever you need:

A  = A1+ { p val[0] };
A1 = B C { val[0] }; # only return "B"

For input B C B C this would output [B, B].

Conflicts

LL(1) grammars can have two kinds of conflicts in a rule:

  • first/first
  • first/follow

first/first

A first/first conflict means that multiple branches of a rule start with the same terminal, resulting in the parser being unable to choose what branch to use. For example:

%terminals A B;

rule = A | A B;

This would result in the following output:

example.rll:5:1:error: first/first conflict, multiple branches start with the same terminals
rule = A | A B;
^
example.rll:5:8:error: branch starts with: A
rule = A | A B;
       ^
example.rll:5:12:error: branch starts with: A
rule = A | A B;
           ^

To solve a first/first conflict you'll have to factor out the common left factor. For example:

%name Example;

%terminals A B;

rule        = A rule_follow;
rule_follow = B | _;

Here the rule rule starts with terminal A and can optionally be followed by B, without introducing any first/first conflicts.

first/follow

A first/follow conflict occurs when a branch in a rule starts with an epsilon and is followed by one or more terminals and/or rules. An example of a first/follow conflict:

%name Example;

%terminals A B;

rule       = other_rule B;
other_rule = A | _;

This produces the following errors:

example.rll:5:14:error: first/follow conflict, branch can start with epsilon and is followed by (non) terminals
rule       = other_rule B;
             ^
example.rll:6:18:error: epsilon originates from here
other_rule = A | _;
                 ^

There's no specific procedure to solving such a conflict other than simply removing the starting epsilon.

Performance

One of the goals of ruby-ll is to be faster than existing parser generators, Racc in particular. How much faster ruby-ll will be depends on the use case. For example, for the benchmark benchmark/ll/simple_json_bench.rb the performance gains of ruby-ll over Racc are as following:

Ruby Speed
MRI 2.2 1.75x
Rubinius 2.5.2 3.85x
JRuby 1.7.18 6.44x
JRuby 9000 pre1 7.50x

This benchmark was run on a Thinkpad T520 laptop so it's probably best to run the benchmark yourself to see how it behaves on your platform.

Depending on the complexity of your parser you might end up with different different numbers. The above metrics are simply an indication of the maximum performance gain of ruby-ll compared to Racc.

Thread Safety

Parsers generated by ruby-ll share an internal, mutable state on a per instance basis. As a result of this a single instance of your parser can not be used by multiple threads in parallel. If it wasn't for MRI's C API (specifically due to how rb_block_call works) this wouldn't have been an issue.

To mitigate the above simply create a new instance of your parser every time you need it and have the GC clean it up once you're done. This will introduce a slight allocation overhead but it beats having to deal with race conditions.

License

All source code in this repository is subject to the terms of the Mozilla Public License, version 2.0 unless stated otherwise. A copy of this license can be found the file "LICENSE" or at https://www.mozilla.org/MPL/2.0/.

The following files are licensed under a different license:

  • ext/c/khash.h: MIT license (see source code)
  • ext/c/kvec.h: MIT license (see source code)