Build regular expressions with natural language.
Have you ever dealt with complex regular expressions like the following one?
var ipMatch = /(?:(?:1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)\.){3}(?:1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)\b/;
Using a meaningful variable name can help, writing comments helps even more, but what's always hard to understand is what the regular expression actually does: They're left as some sort of magic trick that it's never updated because their syntax is so obscure that even the authors themselves hardly fell like facing them again. Debugging a regular expression often means rewriting it from scratch.
RE-Build's aim is to change that, converting the process of creating a regular expression to combining nice natural language expressions. The above regex would be composed as
var ipNumber = RE.group(
RE ("1").then.digit.then.digit
.or ("2").then.oneOf.range("0", "4").then.digit
.or ("25").then.oneOf.range("0", "5")
.or .oneOf.range("1", "9").then.digit
.or .digit
),
ipMatch = RE.matching.exactly(3).group( ipNumber.then(".") )
.then(ipNumber).then.wordBoundary.regex;
This approach is definitely more verbose, but also much clearer and less error prone.
Another module for the same purpose is VerbalExpressions, but it doesn't allow to build just any regular expression. RE-Build aims to fill that gap too.
Remember, as a general rule, that RE-Build does not care if your environment doesn't support certain RegExp
features (for example, the sticky
flag or extended Unicode escaping sequences), as the corresponding source code will be generated anyway. Of course, you'll get an error trying to get a RegExp
object out of it.
Via npm
:
npm install re-build
Via bower
:
bower install re-build
The package can be loaded as a CommonJS module (node.js, io.js), as an AMD module (RequireJS, ...) or as a standalone script:
<script src="re-build.min.js"></script>
For a detailed documentation, check the reference sheet. Keep in mind that RE-Build is a tool to help building, understanding and debugging regular expressions, and does not prevent one to create incorrect results.
The core point is the RE
object (or whatever variable name you assigned to it), together with the matching
method:
var RE = require("re-build");
var builder = RE.matching("xyz");
The output is not, however, a regular expression, but a a regular expression builder that can be extended, or used as an extension for other builders. To get the corrisponding regular expression, use the regex
property or the toRegExp()/valueOf()
methods.
var start = RE.matching.theStart.then(builder).toRegExp(); // /^xyz/
var foo = RE.matching(builder).then.oneOrMore.digit.regex; // /xyz\d+/
As you can see, you can put additional matching blocks using the then
word, which is also a function that can take arguments as blocks to add too. The arguments can be strings (which are backslash-escaped), regular expressions or RE-Build'ers, whose source
property is added to the builder unescaped.
The or
word has a similar meaning, but adds an alternative block to the source:
var hex = RE.matching.digit
.or.oneOf.range("A", "F")
.regex; // /\d|[A-F]/
Regular expression builders are immutable objects, meaning that when extending a builder we get a new builder instance:
var bld1 = RE.matching.digit;
var bld2 = bld1.or.oneOf.range("A", "F");
bld1 === bld2; // => false
RE-Build uses specific names to address common regex character classes:
Name | Result | Notes |
---|---|---|
digit |
\d |
from 0 to 9 |
alphaNumeric |
\w |
digits, uppercase and lowercase letters and the underscore |
whiteSpace |
\s |
white space characters |
wordBoundary |
\b |
|
anyChar |
. |
universal matcher |
theStart |
^ |
|
theEnd |
$ |
|
cReturn |
\r |
carriage return |
newLine |
\n |
|
tab |
\t |
|
vTab |
\v |
vertical tab |
formFeed |
\f |
|
null |
\0 |
|
slash |
\/ |
|
backslash |
\\ |
|
backspace |
\b |
can be used in character sets `[...]' only |
The first four names can be negated prefixing them with not
to get the complementary meaning:
not.digit
for\D
;not.alphaNumeric
for\W
;not.whiteSpace
for\S
;not.wordBoundary
for\B
.
Single characters can be defined by escape sequences:
Function | Result | Meaning |
---|---|---|
ascii(n) |
\xhh |
ASCII character corrisponding to n |
codePoint(n) |
\uhhhh / \u{hhhhhh} |
Unicode character corrisponding to n |
control(a) |
\ca |
Control sequence corrisponding to the letter a |
With the exception of wordBoundary
, theStart
and theEnd
, all of the previous words can be used inside character sets (see after).
You can set the flags of the regex prefixing matching
with one or more of the flagging options:
globally
for a global regex;anyCase
for a case-insensitive regex;fullText
for a "multiline" regex (i.e., the dot '.
' matches new line characters too);withUnicode
for a regex with extended Unicode support;stickily
for a "sticky" regex.
Alternatively, you can set the flags with the withFlags
method of the RE
object.
// The following regexes are equivalent: /[a-f]/gi
var foo = RE.globally.anyCase.matching.oneOf.range("a", "f").regex;
var bar = RE.withFlags("gi").matching.oneOf.range("a", "f").regex;
You can't change a regex builder's flags, as builders are immutable, but you can create a copy of a builder with different flags:
var foo = RE.matching.oneOrMore.alphaNumeric; // /\w+/
var bar = RE.globally.matching(foo); // /\w+/g
If you don't need flags set, as a shortened version you can remove the matching
word:
// These are equivalent:
RE.matching("abc").then.digit;
RE("abc").then.digit;
This becomes useful when defining the content of groups, character sets or look-aheads.
Use the group
word to define a non-capturing group, and capture
for a capturing group:
var amount = RE.matching("$").then.capture(
RE.oneOrMore.digit
.then.noneOrOne.group(".", RE.oneOrMore.digit)
).regex;
// /\$(\d+(?:\.\d+)?)/
The group
and capture
words are function, and the resulting groups will embrace everything passed as arguments. Just like then
and or
, arguments can be strings, regular expression or other RE-Build'ers.
Backrefences for capturing groups are obtained using the reference
function, passing the reference number:
var quote = RE.matching.capture( RE.oneOf("'\"") )
.then.anyAmountOf.alphaNumeric
.then.reference(1);
// /(['"])\w*\1/
Character sets ([...]
) are introduced by the word oneOf
. Several characters can be included separated by the word and
. Additionally, one can include a character interval, using the function range
and giving the initial and final character of the interval.
Exclusive character sets can be obtained prefixing oneOf
by the word not
.
var hexColor = RE.matching("#").then.exactly(6)
.oneOf.digit.and.range("a", "f").and.range("A", "F");
// /#[\da-fA-F]{6}/
var hours = RE.oneOf("01").then.digit.or("2").then.oneOf.range("0", "3");
// /[01]\d|2[0-3]/
var quote = RE.matching('"').then.oneOrMore.not.oneOf('"').then('"');
// /"[^"]+"/
Quantifiers can be defined prefixing the quantified block by one of these constructs:
Construct | Result |
---|---|
anyAmountOf |
* |
oneOrMore |
+ |
noneOrOne |
? |
atLeast(n) |
{n,} |
atMost(n) |
{,n} |
exactly(n) |
{n} |
between(n, m) |
{n,m} |
Quantification is smart enough to translate constructs in their most compact form (e.g., .atLeast(1)
becomes +
, .between(0, 1)
becomes ?
and so on).
Lazy quantifiers can be obtained prefixing the word lazily
prior to the quantifier.
var number = RE.oneOrMore.digit; // /\d+/
var hexnumber = RE.exactly(2).oneOf.digit.and.range("a", "f");
// /[\da-f]{2}/
var macAddress = RE.anyCase.matching(hexnumber).then.exactly(5).group(
RE("-").then(hexnumber)
);
// /[\da-f]{2}(?:-[\da-f]{2}){5}/i
var quoteAlt = RE.matching.capture(RE.oneOf("'\""))
.then.lazily.anyAmountOf.anyChar
.then.reference(1);
// /(['"]).*?\1/
Look-aheads are introduced by the function followedBy
(eventually prefixed by not
for negative look-aheads).
var euro = RE.matching.oneOrMore.digit.followedBy("€");
// /\d+(?=€)/
var foo = RE("a").or.not.followedBy("b").then("c");
// /a|(?!b)c/
- Internet Explorer 9+
- Firefox 4+
- Safari 5+
- Chrome
- Opera 11.60+
- node.js
Basically, every Javascript environment that supports Object.defineProperties
should be fine.
The unit tests are built on top of mocha. Once the package is installed, run npm install
from the package's root directory in order to locally install mocha, then npm run test
to execute the tests. Open index.html with a browser to perform the tests on the client side.
If mocha is installed globally, served side tests can be run with just the command mocha
from the package's root directory.
- More natural language alternatives
- Plurals, articles
- CLI tool to translate regexes to and from RE-Build's syntax
- More examples
- Consider IE8 support
MIT @ Massimo Artizzu 2015-2016. See LICENSE.