Poodle-Lex is a lexical analyzer generator, similar to lex and its variants. It produces source code which scans an input stream and outputs a stream of word units. Features include Unicode support, hierarchical patterns and a platform for supporting arbitrary programming languages through plugins. Usage: Poodle-Lex RULES_FILE OUTPUT_DIR [Optional arguments] Required arguments: RULES_FILE: A file containing token names and pattern definitions OUTPUT_DIR: An existing directory to which the state machine code is written. Optional arguments: -e ENCODING, --encoding ENCODING - Encoding of the input file. ENCODING must be an encoding codec name supported by Python. -l LANGUAGE, --language LANGUAGE - Output emitter language. -c CLASS_NAME, --class-name CLASS_NAME - The name of the lexical analyzer class to generate -s NAMESPACE, --namespace NAMESPACE - The namespace name of the lexical analyzer class to generate -f FILE_NAME, --file-name FILE_NAME - The base name of the generated source files -i ALGORITHM, --minimizer ALGORITHM - Algorithm to use for minimizing the DFA. Default is hopcroft, but polynomial is allowed as well -n DOT_FILE, --print-nfa DOT_FILE - Print a GraphViz(.dot) file with the lexical analyzer's NFA representation -d DOT_FILE, --print-dfa DOT_FILE - Print a GraphViz(.dot) file with the lexical analyzer's DFA representation -m DOT_FILE, --print-min-dfa DOT_FILE - Print a GraphViz(.dot) file with the lexical analyzer's minimized DFA representation. Commands: Poodle-Lex list-languages: Prints out a list of languages supportd by the -language option Poodle-Lex list-minimizers: Prints out a list of DFA minimization algorithms supported by the --minimizer option. Rules file: The rules file is a simple format which describes the lexical analyzer as a set of token definitions, one on each line. The rules file consists of four types of statements: 1) Token rules 2) Variable definitions 3) Token name reservations 4) Comments 5) Section blocks 1) Token rules Token rules take one of the following forms: - [COMMANDS] TOKEN_ID: [PATTERN] [SECTION_ACTION] - Skip: PATTERN [SECTION_ACTION] The first form is a rule which returns a token. TOKEN_ID is the name of the rule. PATTERN is a regular expression defining a pattern of text which should be matched to the rule, surrounded by either single or double quotes. The surrounding quotes can be escaped by adding them twice ("" or '') inside of the pattern. If an "i" appears in front of the opening quote, then the pattern will be interpreted as case-insensitive for ASCII Latin characters. The second form is a rule which is skipped and does not return a token. Here, the same syntax applies, but the token ID may be omitted. PATTERN may consist of multiple strings, concatenated by "+". If a line ends with "+", then the expression may extend over multiple lines. If a piece of text matches two rules, then the rule appearing first is used. SECTION_ACTION is a command which tells the lexical analyzer to enter or exit a different context, referred to as "section", after matching the rule. It can be one of the following: - Enter SECTION_NAME - Switch SECTION_NAME - Exit Section - [Enter|Switch] Section [ATTRIBUTES] ... Exit Section The first form instructs the lexical analyzer to enter an existing section, where SECTION_NAME is an identifier matching the section. For anonymous sections, the rule name is the section name. The second form instructs the lexical analyzer to switch to a different section without remembering the current section. The third form instructs the lexical analyzer to exit the current section. The fourth form actually defines an anonymous section, and instructs the lexical analyzer to enter or switch to it. Sections in other areas of the hierarchy can be referred to using scoped section names. Scoped names are a set of names divided by a period. The current section is searched for a section named as the first name in the set. Then the found section is searched for the second name in the set, and so forth. If the name begins with a period, the root of the rules file is searched. The lexical analyzer remembers the order in which each section is entered. When a section is exited, the lexical analyzer returns to the section that it previously entered. If a section is switched, the current section changes, but the movement is not recorded. COMMANDS is an optional, comma-separated list of words which describe actions take regarding the token. Commands are case-insensitive, and valid commands are: - Skip: The pattern will be matched, but no token will be produced - Capture: The text of the match will be returned along with the token - Reserve: ID will be reserved, and optionally a pattern will be stored - Import: Use the same rule name, and optionally the same pattern as an existing rule. 2) Variable Definitions Variable definitions take the following form: Let VARIABLE_NAME = PATTERN Variables can be substituted within a pattern, but are not themselves rules. VARIABLE_NAME is the name of a variable to define. PATTERN is a pattern in the same format as with token rules. 3) Token name reservations Token name reservations take the following form: Reserve TOKEN_ID[: PATTERN] If used, this will add TOKEN_ID to the list of token ids. This can be used to reserve token ids, or create ids that can be emitted by a preprocessor. A pattern can optionally be stored with the token ID. If the rule is imported, the pattern only needs to be specified if the pattern was not specified in the rule reservation. 4) Comments Any text on a line after "#" is considered a comment. 5) Section Blocks Finally, different contexts, referred to as sections or section blocks, can be defined. These are useful if, at certain points, the lexical analyzer should use a different set of rules. A section can be define in two ways, as a standalone section or anonymously as part of a rule. See the section on rules for how to define a section as part of a rule. Standalone sections take the following form: Section SECTION_NAME [ATTRIBUTES] CONTENT End Section SECTION_NAME is the name of the section. CONTENT is a set of rules, variable definitions, comments, or sub-sections that exist within the context of the section. Sections can contain other sections for organizational purposes, though this only affects behavior if attributes are defined for the section. ATTRIBUTES is a comma-separated list of attributes which can be applied to the section. Each attribute defines the behavior of the lexical analyzer if a rule is not matched on the first character of a token. The following attributes can be applied: - Inherits: If the first character does not fit a rule in this section, attempt attempt to match the rule in the section containing this one - Exits: If the first character does not fit a rule in this section, exit the section and return to the section the lexical analyzer was in previously. Language plug-ins: As packaged, Poodle-Lex supports the following language arguments - freebasic: FreeBasic language emitter with full Unicode support - cpp: C++ emitter which consumes UTF-8 input The default language is freebasic. More languages can be supportd by installing plugins Regular expression rules: Poodle-Lex patterns use a regular expression syntax similar to the IEEE POSIX ERE standard. The pattern will match the characters present, and the following special characters can be used to define more complex patterns: ".": Matches any Unicode codepoint. "[...]": Matches one of any characters inside of the brackets "[^...]": Matches all Unicode codepoints except any characters inside the brackets. "{m, n}": Matches the previous item if it repeats a minimum of m times and a maximum of n times. "*": Matches the prevous item if it appears zero or more times in a row. "+": Matches the previous item if it appears one or more times in a row. "?": Matches the previous item if it appears zero or one times. "(...)": Matches the patern within the parenthesis. "{...}": Matches a pattern defined by a variable, identified by the text between the curly braces "\": Matches a special character or the next character literally "|": Matches either the previous item or the next item. Variable names within a pattern follow the same scoping rules as section names. A scoped name is a set of names separated by a period. If there are multiple names, all names but the last one refer to a list of sections within the hierarchy. The current section is searched for the first section name, then the matched section is searched for the next section name, and so forth. Then, the last matched section is searched for the variable name. If the scoped name begins with a period, the root of the document is searched. The following standard special characters are NOT accepted: "$": In the POSIX ERE standard, matches text at the end of a line or input "^": In the POSIX ERE standard, matches text at the start of a line or input These symbols are not supported because they require additional context. The characters are not allowed except when escaped and they are reserved for future use. The following special characters are supported: "\t": Matches a tab (\x09) character "\r": Matches a line feed (\x0a) character "\n": Matches a carriage return (\x0d) characeter "\x**": Matches a codepoint between 0 and 255, with "*" being any hexidecimal digit "\u****": Matches a codepoint between 0 and 65535, with "*" being any hexidecimal digit "\U******": Matches any Unicode codepoint, with "*" being any hexidecimal digit The following named character classes are also supported within a character class: "[:alnum:]" - Alphanumeric characters ([A-Za-z0-9]) "[:word:]" - Alphanumeric characters + "_" ([A-Za-z0-9_]) "[:alpha:]" - Alphabetical characters ([A-Za-z]) "[:blank:]" - Space and tab ([ \t]) "[:cntrl:]" - ASCII control characters ([\x01-\x1F]) "[:digit:]" - Numerical digits ([0-9]) "[:graph:]" - Visible characters ([\x21-\x7F]) "[:lower:]" - Lowercase characters ([a-z]) "[:print:]" - Printable characters ([\x20-\x7F]) "[:punct:]" - Punctuation ([\]\[\!\"\#\$\%\&\'\(\)\*\+\,\.\/\:\;\<\=\>\?\@\\\^\_\`\{\|\}\~\-]) "[:space:]" - Whitespace characters ([ \t\r\n\v\f]) "[:upper:]" - Uppercase characters ([A-Z]) "[:xdigit:]" - Hexadecimal digits ([A-Fa-f0-9]) Output: When run, the program generates source code which provides an interface. The interface will take in a stream of text, and produce "tokens" in the form of a structure. The structure will contain an enumerated identifier, and optionally a string containing the text which was grouped into the token. Example: The following generates, compiles, and executes a simple FreeBasic test application on Linux and Windows. Python 2.7 and FreeBASIC >= 0.90.1 should be installed prior to running. Windows: 1) Add the Poodle-Lex installation directory to your PATH environment variable 2) Copy any .rules files in the "Example" folder from the installation directory to any folder 3) Navigate to that folder in the command prompt 4) Enter the following, replacing RULES_FILE with the name of the rules file being used: mkdir Output Poodle-Lex RULES_FILE Output cd Output\Demo make_demo Demo Linux: 1) Sync the source 2) Install python 2.7 3) install blist (Ubuntu: sudo apt-get install python-blist) 4) Navigate to the source directory 4) Enter the following, replacing RULES_FILE with the nae of the rules file being used: mkdir Output python __main__.py RULES_FILE Output cd Output/Demo ./make_demo.sh ./Demo Credits: Parker Michaels - Primary developer AGS - C lexer example, countless bug reports and tests Thank you to everyone who helped test or provided feedback for this tool. This includes members of the FreeBasic.net forum such as TFS.