/thesis

Lexical Analysis using a Bimachine

Primary LanguageC#

Lexical Analysis using a Bimachine

Lexer Generator API

[Implementation]

Usage

Generate a lexer by providing a set of lexical rules.

var lexer = Lexer.Create(new[]
{
    new Rule("[0-9]+\\.?[0-9]*", "NUM"),
    new Rule("[+*/-]", "OP"),
    new Rule("=", "EQ"),
    new Rule("[ \t\r\n]", "WS")
});

Set the input either as a string

lexer.Input = new InputStream("3.14+1.86=5");

or as a stream.

using var reader = new StreamReader("path/to/file.txt");
lexer.Input = new InputStream(reader);

Perform the tokenization.

foreach (Token token in lexer.GetNextToken())
    Console.WriteLine(token);
3.14+1.86=5
[@0,0:3='3.14',<NUM>]
[@1,4:4='+',<OP>]
[@2,5:8='1.86',<NUM>]
[@3,9:9='=',<EQ>]
[@4,10:10='5',<NUM>]

Lexer.GetNextToken() returns an iterator which yields a sequence of tokens one at a time. The Token class contains the following properties:

public class Token
{
    public int Index { get; set; }
    public (int Start, int End) Position { get; set; }
    public string Text { get; set; }
    public string Type { get; set; }

    public override string ToString() =>
        $"[@{Index},{Position.Start}:{Position.End}='{Text}',<{Type}>]";
}

Import/Export

Depending on the grammar, the lexer construction might take some time. Hence, once built, it can be exported as a binary file and then loaded in a project.

var lexer = Lexer.Create(grammar);
lexer.ExportToFile("lexout");
var lexer = Lexer.LoadFromFile("lexout");

Example Lexers

[Tests]

  • Arithmetic expression
  • JSON
  • Regular expression
  • Tokenizer for the English language

Building Blocks

Regular Expression Parser

[Implementation] [Tests]

Supported syntax

Expression Example
Concatenation ab
Union a|b
Zero-or-more a*
One-or-more a+
Optional a?
Grouping (a|b)*c
Char class [a-z0-9,]
Negative char class [^a-z0-9,]
Match exactly 2 'a's a{2}
Match at least 2 'a's a{2,}
Match between 2 and 4 'a's a{2,4}
Escaping (matches "a*") a\*
Escaping in char classes (match 'a', '-' or 'z') [a\-z]

Finite-state device construction and operations

Text Rewriters based on Regular Relations

[Implementations] [Tests]

  • Optional rewrite transducer
  • Obligatory rewrite transducer
  • Leftmost-longest match rewrite transducer

Misc

Example usages of the APIs can be found in the unit tests' project.

GraphViz

The finite state devices Fsa, Dfsa and Fst can be exported as GraphViz.

var fst = new RegExp("a+").Automaton
  .Product(new RegExp("b").Automaton)
  .ToRealTime()
  .Transducer
  .PseudoMinimal();

Console.WriteLine(fst.ToGraphViz());
digraph G {
    rankdir=LR; size="8,5"
    node [shape=circle]
    -1 [label= "", shape=none,height=.0,width=.0];
    -1 -> 0;
    0 -> 1 [label="a/b"]
    0 -> 0 [label="a/ε"]
    1 -> 1 [label="a/ε"]
    1 [shape = doublecircle]
}

References