This is a technique to allow you to have a switch case feature that supported strings.
For example,
int foo(string i) {
switch(i) {
case "cad":
return 7;
case "bed":
return 13;
case "bad":
return 42;
default:
return 0;
}
}This method is O(m) - where m is the longest case string length. The case strings must be constant.
Rather than using an if chain - testing the input against each individual case with strcmp(), we are going to use a DFA that's encoded into a table.
Table
enum STATES: unsigned char {
INIT,
b, ba, be, c, ca, cad, bed, bad,
X,
cad_accept, bed_accept, bad_accept,
};
constexpr STATES table[7][9] = {
{ X, ba, X, X, ca, X, X, X, X}, // a
{ b, X, X, X, X, X, X, X, X}, // b
{ c, X, X, X, X, X, X, X, X}, // c
{ X, X, bad, bed, X, cad, X, X, X}, // d
{ X, be, X, X, X, X, X, X, X}, // e
{ X, X, X, X, X, X,cad_accept,bed_accept,bad_accept}, // NUL
{ X, X, X, X, X, X, X, X, X} // Anything else
};
constexpr int charclass(char ic) {
// Limit the table down to the characters you actually need. With one 'other value (5 in this example)
if (ic == 0) return 5;
if ((ic > 'e') || (ic < 'a')) return 6;
return (ic - 'a');
}This table is compressed, which is beneficial but not necessary. We are using charclass() to narrow the values of the input down to smaller range (0 to 6).
Table Element "State" Values
The values within the table represent a state within the DFA. These are split into three groups;
-
The first set represent states that are partial matches, and will continue to consume the input.
-
The second set/value, X, represents a non matching string.
-
The third set are result states, representing "cad", "bed", and "bad" in this example.
The value zero is reserved for the initial state of the DFA.
Table Rows and Columns
Rows are "characters", and columns is the current state the DFA is in. table[7][9] says we accept 7 unique characters (a,b,c,d,e,NUL,*), and there are a total of 9 partial match states.
One row is a letter in the valid letter range, defined by the charclass. E.g. a = 0, b = 1, c = 2, d = 3, e = 4, NUL = 6, anything else = 6.
Generating The Table
-
Determine the state sets, {INIT} {partial matches} {X} {matches}
-
Set the table height to be the range returned by charclass ( 0 to 6 is [7] )
-
Set the table width to be the size of the INIT + partial matches sets ( INIT, b, ba, be, c, ca, cad, bed, bad is [9] )
-
We start by filling all entries in the table with
X -
Starting at column zero (
INIT), enter any valid states that can be transitioned from this position (which arebandc). -
Now move right one column, this corresponds to state
b(the enum value is the offset into the table), now enter the valid transitions from this state position (which arebaandbe) -
Repeat step 6 until all states are filled.
Processing
index() turns a string into an integer, which corresponds to the accepted STATES table values (or 0,1,2,3 as below).
Starting at column 0 (INIT), we process the input on char at a time, we move left to right along the table.
constexpr int index(const char* s) {
auto pos = INIT;
do { pos = table[charclass(*s++)][pos]; } while(pos < X);
return pos - X; // Lets shift the output to be 0,1,2,3
}If the input s was "bed", pos would be INIT => b => be => bed => bed_accept through the loop.
If the returned state value is higher than X - if we have a non-match or a valid matching state (cad_accept, bed_accept, bad_accept), we exit the loop.
If you are going to use a jump table, you might want to shift the results so the lowest value returned is zero.
Further Thoughts
-
This method supports ranges easily, Complete the table as normal but return the same accept state at the end
-
If your string has the length precomputed, adding a test for the
if (i.length > m)might be beneficial. -
If your strings are not null terminated, you will have to adjust the loop in index() and fixup to the non-matching value