/RegExtract

Clean & simple idiomatic C# RegEx-based line parser that emits strongly typed results.

Primary LanguageC#MIT LicenseMIT

RegExtract

Quick and dirty idiomatic C# line parser that emits typed ValueTuples.

dotnet NuGet Downloads License

Table of Contents

Usage Examples

Date from email header

DateTime date = "Date: Mon, 7 Dec 2020 19:43:24 -0800".Extract<DateTime>(@"Date: (.*)");

List of words

List<string> words = "The quick brown fox jumped over the lazy dogs.".Extract<List<string>>(@"(?:(\w+)\W*)+");

Alternation

var (n,s) = "str".Extract<(int?,string)>(@"(\d+)|(.*)");

Parsing fields

var (n1,y1,e1) = "Hello, earthling, from 2077!".Extract<ParseResult>(@"Hello, (.*), from (?:(\d+)|(.*))!");
var (n2,y2,e2) = "Hello, martian, from earth!".Extract<ParseResult>(@"Hello, (.*), from (?:(\d+)|(.*))!");

record ParseResult(string name, int? year, string loc);

Enums and Flags

var mode = "OpenOrCreate".Extract<FileMode>(@".*");
var flags = "Public,Static".Extract<BindingFlags>(".*");

History

This project came about during day 2 of Advent of Code 2020. The task involved parsing a strings that looked something like:

1-3 a: abcde
1-3 b: cdefg
2-9 c: ccccccccc

From each one, I needed two numbers, a character, and a string. 25 years ago I would have written the following absolutely trivial C code that would take care of parsing it:

int lo, hi; char ch; char pwd[50];

sscanf(line, "%d-%d %c: %s", &lo, &hi, &ch, pwd);

It bothered me that the C to parse this was so much simpler than what I would wirte in C#. For contests, like Advent of Code or Google Code Jam, I usually make extensive use of .Split() a la:

(int lo, int hi, char ch, string pwd) ParseLine(string line)
{
    var splits = line.Split(" ");
    var nums = splits[0].Split("-").Select(int.Parse).ToArray();
    return (nums[0], nums[1], splits[1][0], splits[2]);
}

But that's fiddly to write, and even fiddlier to read. It's good enough for contest conditions, but leaves lots to be desired for sharing solutions and discussing approaches with other competitors.

Undoubtedly, a regular expression has most of the simplicity of that sscanf template from above--you can see what the extra characters are in the template, and you can see what's being extracted:

@"(\d+)-(\d+) (.): (.*)"

Unfortunately .NET Regex Matches leave a lot of fiddling after matching to get to the simple types that you want to use for computation:

(int lo, int hi, char ch, string pwd) ParseLine(string line)
{
    var match = Regex.Match(line, @"(\d+)-(\d+) (.): (.*)");

    return (int.Parse(match.Groups[1].Value),
            int.Parse(match.Groups[2].Value),
            match.Groups[3].Value[0],
            match.Groups[4].Value);
}

So I set out to design the best possible C# syntax that got me somewhere near the expressiveness and simplicity of the sscanf example from the 1960s.

Here's where I settled:

var (lo, hi, ch, pwd) = line.Extract<(int, int, char, string)>(@"(\d+)-(\d+) (.): (.*)");

From there I added support for types with friendly constructors:

var (lo, hi, ch, pwd) = line.Extract<template>(@"(\d+)-(\d+) (.): (.*)");

record template(int lo, int hi, char ch, string pwd);

And support for extracting named groups to properties:

var result = line.Extract<template>(@"(?<lo>\d+)-(?<hi>\d+) (?<ch>.): (?<pwd>.*)");

class template {
    public int lo { get; set; }
    public int hi { get; set; }
    public char ch { get; set; }
    public string pwd { get; set; }
}