/effigy

Small language that compiles to Python37 bytecode

Primary LanguageJavaScriptGNU General Public License v3.0GPL-3.0

Table of Contents

  1. Effigy
    1. How to play with it
      1. Currently Supported Types of Values
      2. Language Features
      3. Very useful things missing
    2. How does it work
      1. Parser Generator for Parsing Expression Grammars (PEG)
    3. Host Language
    4. Resources
      1. On Parsing & Parsing Expression Grammars
      2. On the Python Compiler & Bytecode Format

Effigy

This is an experiment on building a small language compiler on top of a home brewed parsing expression grammar implementation.

The language implemented in this project, effigy, currently compiles down to a subset of the Python 3.7 bytecode format. More specifically, the Effigy compiler produces .pyc files.

Effigy's runtime is the Python 3.7 Virtual Machine. The difference is just how the bytecode gets generated. Most idioms like declaring literals, calling functions, assigning variables etc have the exact same semantics as in regular Python code.

Effigy differs from Python on the use of functions for control flow a little more often and the absence of classes (might be added later).

How to play with it

Effigy is currently a teeny little JavaScript program. You can install it with npm i efgc. After that, you can type your effigy programs in a file and then run efgc yourfile.efg. That will generate a .pyc file in the same directory as the source file that can be ran with Python (currently only 3.7).

Here's what's available and some of what's not:

Currently Supported Types of Values

  • integers
  • strings (double quotes only. Single quotes currently yield syntax error)
  • lists
  • functions (named and anonymous)

Language Features

  • Arithmetic Operators
  • Logic Operators
  • Comparison Operators
  • Flow Control (if/else/while/for)
  • Exceptions (single catch block for now)
  • Imports

Very useful things missing

  • Slice notation
  • Variadic arguments
  • Named/Default parameters
  • Floating points

How does it work

As mentioned in the introduction, Effigy is an experiment. So it probably won't be a good example of how to write the next industry standard compiler, but it should give insights about what compilers do and at least one way of doing it.

The current version of the efgc compiler is broken down into three main pieces: 1) PEG parser-generator, 2) bytecode translator, 3) assembler. Let's look at them separately.

Parser Generator for Parsing Expression Grammars (PEG)

The PEG is the most basic component of this compiler. It's what the compiler uses to 1) Parse the program text into a parse tree and 2) to transform the parse tree into bytecode.

PEGs provide very similar functionality compared to Context Free Grammars. The most relevant difference is 1. being deterministic 2. allowing infinite lookahead via predicates. This allows PEGs to provide functionality for both syntactical and semantic matching. To read beyond this vague definition, I suggest reading the article that introduced the concept.

The API for parsing text currently looks like this:

> const g = peg.pegc('Digit <- [0-9]+');  // Compile Grammar
> g.match('123')                          // Match some input
['Digit', ['1', '2', '3']]

There's also an API for matching data structures (lists):

> peg.pegc('List <- { "a" { "b" } }').matchl(["a", ["b"]])
['L', ['a', ['b']]]

In very practical terms, this home grown PEG implementation is being used in the parser and the translator pieces. And besides the grammar language, this PEG also provides semantic actions exposed via the JavaScript API (not in the grammar language). Allowing the user to declare traversals for the output trees captured from successful matching. E.g.:

> const join = x => Array.isArray(x) ? x.join('') : x; // Helper for joining lists of strings together
> const g = peg.pegc('Digit <- [0-9]+') // Compile Grammar
> const r = g.bind({ Digit: ({ visit }) => parseInt(join(visit()), 10) }); // Bind semantic actions
> r('123')
123

It is worth mentioning that bindl() is also available for binding semantic actions to a generator that will process data structures (lists) instead of text.

The semantic actions are modular. They're not executed until the whole match is finished successfully. That way, the user of the PEG engine doesn't ever have to think about the backtracking that happens behind the scenes.

This PEG implementation has no dependencies besides the host language used to write the file peg.js.

Sadly there are a few valuable things that I didn't get to implement yet that would considerably increase the quality of the PEG implementation:

  • Error Reporting. Although parser generators sometimes get bad fame for their error reporting, there is some modern literature on how to allow pretty good error reporting. The best this PEG does is to report accurately the farther failure position heuristics that tell how far on the input the current grammar was able to match before the error happened. Link for the aforementioned modern literature. Current error reporting on list matching is awful to say the least. It literally only tells you that it didn't match a list.

  • Arity of PEG operators. The operator OneOrMore (+) returns an item if it matches one and a list if it matches many. And the list is flattened. The ZeroOrMore (*) operator behaves similarly to (+) but can also return nothing. Which is represented with null. These are a bit confusing but I'm not really sure if I found all the answers to design something better yet.

  • Left recursion. There's a branch for supporting that. It currently misses mutual left recursion support so it's not merged yet. The implementation leverages bounded left recursion.

Host Language

Although the first target of the little compiler is a subset of Python, JavaScript was chosen as the host language for a few reasons:

  1. I didn't want to do it in Python because it'd be very tempting to use one of its modules for parsing, scope analysis or code generation. I wanted to implement all the pieces of the compiler to be able to reason how far I could leverage the PEG to do those tasks.

  2. Python and JavaScript have very similar semantics for closures but present slight differences in how side-effect (assignment) of values declared in enclosed scopes work. Java Script separates assignment from declaration, Python provides the nonlocal keyword.

    I wanted something right in the middle for Effigy: Assignment is coupled to declaring a variable, but provides the keyword let to mark names to be saved as closures so assignments in deeper scopes will know its not a new value.

  3. It doesn't really matter. The goal is to rewrite Effigy with Effigy.

Resources

On Parsing & Parsing Expression Grammars

On the Python Compiler & Bytecode Format