SimpleRegex
An experimental regular expression library based on Thompson NFA. The idea was inspired by Russ Cox's article “Regular Expression Matching Can Be Simple And Fast”.
Build
This project was written in Java, built using Gradle and the parser for regular expression was generated by Bison. As a result, JDK, Gradle and Bison are required to build this project.
# Using gradle wrapper
./gradlew build # Build
./gradlew jar # Generate jar
# Using gradle
gradle build # Build
gradle jar # Generate jar
Build configurations are located in com.sine_x.regex,BuildConfig
:
package com.sine_x.regex;
public class BuildConfig {
public static final boolean LOG = false; // Logger switch
}
Test
Test library using strings generated based on some regular expression and "shuffled" strings. Regular expressions used are located in test/regex.txt
.
# Using gradle wrapper
./gradlew driver
# Using gradle
gradle driver
Usage
- Create a pattern using
public static Pattern Pattern.compile(String)
- Match a string using
public boolean Pattern.match(String)
PS: com.sine_x.regex.RegexException
should be handled.
Pattern pattern = Pattern.compile("(Ha)+");
boolean result = pattern.match("HaHaHa");
Features
- Based on Thompson NFA without backtracking
- Implemented DFA cache
- Parsing regular expression with bison
- Support wildcard, character set, repeating, character class, Unicode character
Syntax
kinds of single-character expressions | examples |
---|---|
any character, possibly including newline (s=true) | . |
character class | [xyz] |
negated character class | [^xyz] |
Perl character class | \d |
negated Perl character class | \D |
Composites | |
---|---|
xy |
x followed by y |
x|y |
x or y (prefer x ) |
Repeating | |
---|---|
x* |
zero or more x |
x+ |
one or more x |
x? |
zero or one x |
x{n,m} |
n or n +1 or ... or m x |
x{n,} |
n or more x |
x{n} |
exactly n x |
Perl character classes (all ASCII-only) | |
---|---|
\d |
digits ( [0-9] ) |
\D |
not digits ( [^0-9] ) |
\w |
word characters ([0-9A-Za-z_] ) |
\W |
not word characters ( [^0-9A-Za-z_] ) |