Elzawawy/compiler-frontend

[LEXGEN-1] Build Language Parser

Elzawawy opened this issue · 0 comments

The Language Parser Class

  • In other words, it is a file reader, that parses lexems specification.
  • Takes an input file with format specified in requirements document.
  • Emits an output with (Table of Inputs - List of Regular Expressions).

Token

It is a pair consisting of a token name (class) and an optional attribute value. The token names are the input symbols that the parser processes.

Pattern

It is a description of the form that the lexemes of a token may take.

Lexeme

It is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

Some notes:

  1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.
  2. Tokens for the operators, either individually or in classes.
  3. One token representing all identifiers.
  4. One or more tokens representing constants, such as numbers and literal strings.
  5. Tokens for each punctuation symbol, such as left and right parentheses,comma, and semicolon.

Alphabet

It is any finite set of symbols. The set {0,1} is the binary alphabet.

String

A string over an alphabet is a finite sequence of symbols drawn from that alphabet. The empty string, denoted e, is the string of length zero.

Language

It is any countable set of strings over some fixed alphabet.
In lexical analysis, the most important operations on languages are Union, Concatenation, Kleene Closure, and Positive Closure

RE (Regular Expressions)

  • A convenient notation for specifying lexeme patterns.
  • While they cannot express all possible patterns, they are very effective in specifying those types of patterns that we actually need for tokens.
  • Regular expressions notation has come into common use for describing all the languages that can be built from language operators applied to the symbols of some alphabet.
  • Each regular expression r denotes a language L(r), which is also defined recursively from the languages denoted by r's sub-expressions.
  • Regular expressions often contain unnecessary pairs of parentheses. We may drop certain pairs of parentheses if we adopt the conventions that:
  • The unary operator * has highest precedence and is left associative.
  • Concatenation has second highest precedence and is left associative.
  • Union | has lowest precedence and is left associative.
  • We may replace the regular expression(a)|((b)*(c)) by a|b*c.
  • A language that can be defined by a regular expression is called a regular set. If two regular expressions r and s denote the same regular set, we say they are equivalent and write r = s.

RD (Regular Definitions)

  • We may wish to give names to certain regular expressions and use those names in subsequent expressions, as if the names were themselves symbols.