/SimpleRegex

An experimental regular expression library

Primary LanguageJavaApache License 2.0Apache-2.0

SimpleRegex

An experimental regular expression library based on Thompson NFA. The idea was inspired by Russ Cox's article “Regular Expression Matching Can Be Simple And Fast”.

Build

This project was written in Java, built using Gradle and the parser for regular expression was generated by Bison. As a result, JDK, Gradle and Bison are required to build this project.

# Using gradle wrapper
./gradlew build		# Build
./gradlew jar		# Generate jar
# Using gradle
gradle build		# Build
gradle jar			# Generate jar

Build configurations are located in com.sine_x.regex,BuildConfig:

package com.sine_x.regex;

public class BuildConfig {

    public static final boolean LOG = false;	// Logger switch
}

Test

Test library using strings generated based on some regular expression and "shuffled" strings. Regular expressions used are located in test/regex.txt.

# Using gradle wrapper
./gradlew driver
# Using gradle
gradle driver

Usage

  1. Create a pattern using public static Pattern Pattern.compile(String)
  2. Match a string using public boolean Pattern.match(String)

PS: com.sine_x.regex.RegexException should be handled.

Pattern pattern = Pattern.compile("(Ha)+");
boolean result = pattern.match("HaHaHa");

Features

  • Based on Thompson NFA without backtracking
  • Implemented DFA cache
  • Parsing regular expression with bison
  • Support wildcard, character set, repeating, character class, Unicode character

Syntax

kinds of single-character expressions examples
any character, possibly including newline (s=true) .
character class [xyz]
negated character class [^xyz]
Perl character class \d
negated Perl character class \D
Composites
xy x followed by y
x|y x or y (prefer x)
Repeating
x* zero or more x
x+ one or more x
x? zero or one x
x{n,m} n or n+1 or ... or m x
x{n,} n or more x
x{n} exactly n x
Perl character classes (all ASCII-only)
\d digits ( [0-9])
\D not digits ( [^0-9])
\w word characters ([0-9A-Za-z_])
\W not word characters ( [^0-9A-Za-z_])

Note

Bibliography