/opal

Primary LanguageC++Apache License 2.0Apache-2.0

regex

# comments can appear anywhere
# groups are (?:...), captures are ()

# regex for a.b+
START CAPTURE 0
MATCH 1 CHAR A
MATCH 1 WILD
MATCH + CHAR B
END CAPTURE 0
# my dfa definition
s0->s1;char1,char2,char3
# nfa is same, with epsilon

# regex, dots used to simplify parsing
(a.b.c*)|a

implement efficent regex by compiling to my simple regex, then nfa, then dfa, then pruning dfa and building a FSM in C We can handle wild cards with special states in the DFA

feature

  • powerset construction
  • fix parser
    • allow basic concat, union, kleene, and grouping
    • add +, ?
    • add () as capturing groups
    • add (?:), supports non capturing groups
    • add []
    • add .
    • non greedy matching +? and *?
    • add [^]
    • add {n}
    • add {min,}
    • add {,max}
    • add {min,max}
    • add ^ and $, supports search vs match
    • back references, maybe doing a limited max of 9 or soemething
    • possessive matching
    • look ahead and look behind
    • consider predefined character groups
  • add support to extract groups
  • add a search vs match mode
  • jitted regex

software engineering todo

  • add my suite of makefile and scripts
  • separate out files
  • ensure better const correctness
  • toC, toNasm, etc using visiotor pattern
  • add iterator support for states and transitions