A tiny library for writing tiny parsers in C#. The core idea is building parsers from simple rules.
For example, parsing a basic quoted string:
using static PageOfBob.Parsing.Rules;
using static PageOfBob.Parsing.StringRules;
// Define the two special characters: " and \
var QUOTE = Char('"');
var ESCAPE = Char('\\');
// Define classes of characters. Characters that require escape, and those that do not.
var charThatRequiresEscape = Any(QUOTE, ESCAPE);
var charThatDoesNotRequireEsquape = charThatRequiresEscape.Not();
// Define what an escaped character looks like (slash followed by something that requires an escape)
var escapedChar = ESCAPE
.ThenKeep(charThatRequiresEscape);
// Define the characters in the "middle" of the quoted string
// (either escaped characters or characters that do not require an escape).
// Accept zero or more of those characters.
// Then join all those characters into a string.
var center = Any(charThatDoesNotRequireEsquape, escapedChar).Many().JoinText();
// Define the final rule, a quote, middle of the quote, then a closing quote.
var quotedStringRule = QUOTE
.ThenKeep(center)
.ThenIgnore(QUOTE);
return quotedStringRule;
And here is how you would use this rule:
var input = Sources.CharSource("\"this is a test\"");
var result = quotedStringRule(input);
var finalValue = result.Match(
fail => {
Console.Error.WriteLine($"Failed: {fail.Message}");
return null;
},
success => {
Console.Out.WriteLine($"Parsed String: {success.Value}");
return success.Value;
}
)
Obligatory Disclaimer: This parsing library is built to make writing small parsers directly in C# simple. It is not optimized for speed or memory usage. It works well for parsing relatively simple objects in memory. For very large token streams/object graphs, or where performance is critical, other solutions are probably more appropriate.
Parsers are built from simple rules. The parser rule is a method that takes a token source (ISource<TToken>
) and returns a result (IResult<TToken, TOutput>
). In general, rules should be pure functions.
A token source can either be empty (Empty<TToken>
) or non-empty (Source<TToken>
). A non-empty source contains a single token (TToken Token { get; }
), a method to get the next token from the stream (ISource<TToken> Next()
), and the current position (long Position { get }
) in the stream (for some streams this may not be applicable). An empty source will only contain the position (long Position { get }
), which should be equal to the length of the stream, if the stream has a length.
A result (IResult<TToken, TOutput>
) can either be a failure (Failure<TToken, TValue>
) which contains only an error message (string Message { get; }
) or a success (Success<TToken, TValue>
) which contains a parsed value (TValue Value { get; }
) and the next token source from the stream (ISource<TToken> Next { get; }
).
Rules are then combined to form complex parsers build from simple blocks.
Here is a naïve example of a simple rule that accepts any even digit:
var cast = input as Source<char>;
if (cast == null)
return Result.Fail<char>("Input is empty");
var c = cast.Token;
if (c != '2' && c != '4' && c != '6' && c != '8')
return Result.Fail<char>("Expected a digit");
return Result.Success(c, cast.Next());
This example has room for improvement. The casting of input is not necessary, as there is an extension method that allows you to treat a source (ISource<TToken>
) like a discriminated union.
This could be made shorter as:
return input.Match(
// Fail if the source is empty
() => Result.Fail<char>("Input is empty"),
// Source is not empty, check it's token against known-good tokens
source =>
{
if (source.Token != '2' && source.Token != '4' && source.Token != '6' && source.Token != '8')
return Result.Fail<char>("Expected a digit");
return Result.Success(source.Token, source.Next());
});
But even this can be made shorter using built-in Rules. This rule can be written a couple ways:
// Check using Rules.Match and a delegate.
var evenDigit = Rules.Match<char>(x => x == '2' || x == '4' || x == '6' || x == '8');
// Check using Any to test a rule for each possible char.
var evenDigit = Rules.Any(StringRules.Char('2'), StringRules.Char('4'), StringRules.Char('6'), StringRules.Char('8'));
Note that in this example, Rules.Match
and StringRules.Char
both already have empty source check as part of their logic, so that can be omitted.
These can be made slightly shorter by using static
the Rules
and StringRules
classes.
using static PageOfBob.Parsing.Rules;
using static PageOfBob.Parsing.StringRules;
var evenDigit = Match<char>(x => x == '2' || x == '4' || x == '6' || x == '8');
// OR:
var evenDigit = Any(Char('2'), Char('4'), Char('6'), Char('8'));
An ISource<T>
is an interface to a source of tokens that parsing rules will understand. Below are the basic Source implementations provided:
CharSource
- Accepts a string and returnschar
tokens. If working with an entire string in memory, this is the one you want to use.ListSource<T>
- Accepts anIList<T>
and returnsT
tokens.EnumerableSource<T>
- Accepts anIEnumerable<T>
and returnsT
tokens. Note: In order to support backtracking, this Source can potentially keep a reference to every token in the stream. If possible, prefer a more specific source (such asListSource<T>
orCharSource
). This source is most appropriate when used with theTokenize
extension method (sinceTokenize
discards processed input as it yields tokens).StreamSource<byte>
- Untested - In theory, this works with a seekable stream and only buffers about 4K into memory at a time, but this is completely untested and may have terrible performance characteristics, weird side affects, or unfathomable bugs. You have been warned.
Below is a list of built-in rules. Probably the fastest way to familiarize yourself with these rules and how to combine them would be to look through the unit tests.
Rule<T, T> Match<T>(Func<T, bool> match, string message = null)
- Basis for most single-token parsing rules. Succeeds for non-emtpy source and whenmatch
returns true, otherwise fails.Rule<T, T> Any<T>()
- Always succeeds with non-empty source and consumes/returns 1 token.Rule<T, T> Match<T>(T value, string message = null)
- Succeeds if non-empty source and token matchesvalue
exactly.Rule<T, K[]> Sequence<T, K>(IEnumerable<Rule<T, K>> rules, string message = null)
- Each rule in therules
sequence must succeed.Rule<T, O> Map<T, I, O>(this Rule<T, I> rule, Func<I, O> map)
- Ifrule
succeeds, then the value from therule
result is transformed by themap
function.Rule<T, K> Any<T, K>(params Rule<T, K>[] rules)
- Returns the first successful result fromrules
. Otherwise, fails.Rule<T, O> Then<T, L, R, O>(this Rule<T, L> left, Rule<T, R> right, Func<L, R, O> map)
- Attempts theleft
rule, then if that succeeds theright
rule. If both rules succeed, then the results from each are passed tomap
to create the final result. Otherwise, fails.ThenIgnore<T, L, R>(this Rule<T, L> left, Rule<T, R> right)
- If both rules succeed, returns the result of theleft
rule (ignores value fromright
rule).Rule<T, R> ThenKeep<T, L, R>(this Rule<T, L> left, Rule<T, R> right)
- If both rules succeed, returns the result of theright
rule (keeps value fromright
rule).Rule<T, K> ThenSet<T, R, K>(this Rule<T, K> left, Rule<T, R> right, Action<K, R> action)
- If both rules succeed, callsaction
to mutate the result of theleft
rule using the value from theright
rule. Useful for building complex objects, even it's not exactly a pure function.Always<T, K>(Func<K> action)
- Always succeeds with the value ofaction
without consuming any input.Rule<T, K[]> Many<T, K>(this Rule<T, K> rule, bool required = false, string message = null)
- Appliesrule
repeated until it fails, returns all resulting values as an array. Ifrequired
is true, thenrule
must match at least once. Ifrequired
is false, then this will succeed event ifrule
never succeeds -- will return an empty array as a value.Rule<T, T> Not<T>(this Rule<T, T> rule, string message = null)
- Only works on token matching, but will return any tokens that do not succeed againstrule
.Optional<T, K>(this Rule<T, K> rule, K defaultValue)
- Makesrule
optional. Ifrule
does not succeed, a success value ofdefaultvalue
is returned and no input is consumed.WithMessage<T, K>(this Rule<T, K> rule, string message)
- Allows one to override the error message of any rule.ThenEnd<T, K>(this Rule<T, K> rule)
- Succeeds only ifrule
succeeds and the input is empty afterrule
succeeds.
These rules are in the StringRules
static class:
Rule<char, char> Char(char c)
- Matches a specific character (case-sensitive).Rule<char, char> IChar(char c)
- Matches a specific character (case-insensitive).Rule<char, char> IsLetter
- Matches any letter.Rule<char, char> IsDigit
- Matches any digit.Rule<char, string> Text(string text, string message = null)
- Matches a string of text (case-sensitive).Rule<char, string> IText(string text, string message = null)
- Matches a string of text (case-insensitive).Rule<char, string> JoinText(this Rule<char, char[]> rule)
- Maps a char array result into a string.Rule<char, string> ManyAsString(this Rule<char, char> rule, bool required = false, string message = null)
- The same asMany().JoinText()
, but uses aStringBuilder
rather than anIList<char>
andstring.Join
.
Like ISource<TToken>
, IResult<TToken, TOutput>
has an extension method that allows one to treat it like a discriminated union. For example:
var results = rule(input);
bool wasSuccess = results.Match(
fail => {
Console.Error.WriteLine("Oh no! " + fail.Message);
return false;
},
success => {
Console.Out.WriteLine("Success! " + success.Value.ToString());
return true;
});
IEnumerable<K> Tokenize<T, K>(this Rule<T, K> rule, ISource<T> source, bool throwOnFailure = true)
- Executesrule
againstsource
until out of input, yielding results. Ifrule
fails andthrowOnFailure
is true, it will throw aFormatException.
IfthrowOnFailure
is false, it will stop yeilding results and ignore any remaining input. This may be useful for building SAX-like lexers that parse fragments without building a complete object graph.