Tools for parsing two-dimensional programming languages.
Suppose we want to parse a diagram representing a path, with >
, v
, <
, and ^
each being single steps.
>v >>
v ^
>>>^
One way of tokenizing this is to interpret each of these steps as a token, with a value representing its direction.
from parse_2d import Diagram, TinyTokenizer, tokenize
diagram = Diagram.from_string(">v >>\n v ^\n >>>^")
tokenizers = [
TinyTokenizer(">", 0),
TinyTokenizer("v", 1),
TinyTokenizer("<", 2),
TinyTokenizer("^", 3),
]
for token in tokenize(diagram, tokenizers):
print(token)
Each Token
has a region and a value. The region is what area it covers in the original diagram, while the value can be any Python object representing what you've tokenized.
Alternatively, you can extract the path as a single token, using the WireTokenizer
, or as a directed path, by subclassing WireTokenizer
.
A more complete sample is also provided, to demonstrate the use of these tools, by parsing the Circuit Diagram language.
A Diagram
is an infinite two-dimensional grid of "symbols", with a distinguished "whitespace" symbol. Diagram
s may be instantiated with a list of lists and the whitespace symbol, or by the from_string
method.
>>> diagram = Diagram([[1, 2], [3]], 0)
>>> diagram[(0, 1)]
3
>>> diagram[(1, 1)]
0
>>> diagram[(-30, 17)]
0
>>> diagram = Diagram.from_string("ab\nc")
>>> diagram[(0, 1)]
'c'
>>> diagram[(1, 1)]
' '
A Region
is an area on a diagram. Custom Region
s may be made by inheriting from Region
. The following Region
s are provided by default:
A Region
consisting of a single point. Has the location
property to provide that point.
A rectangular Region
, aligned with the axes, consisting of the points bounded by top_left
and bottom_right
, including the top and left edges, and excluding the bottom and right edges (analogously to range
).
A Region
consisting of a collection of disparate points. Has the contents
property to provide that frozenset
of points.
A Token
consists of a region
covered, and a value
that the token represents.
A Tokenizer
is an object for extracting tokens from diagrams. Custom Tokenizer
classes may be made by inheriting from Tokenizer
, and overriding the starts_on
and extract_token
methods. See the Tokenizer
docstring for more details.
Tokenizer for tokens represented by a single symbol.
Extracts a token of value token_value
for every symbol
in the diagram.
Tokenizer for tokens represented by a fixed template of symbols.
The template
is either a mapping of relative locations to symbols, or a Diagram
.
Extracts a token of value token_value
for every non-overlapping translation of the template found in the parent Diagram.
Tokenizer for wire tokens, represented by a path through a diagram.
A wire consists of multiple symbol "segments", each of which has a fixed collection of directions it can connect to.
The segment_connections
is a mapping from segment symbols to a collection of that segment's available connections.
Extracts a wire token representing the available connections to that wire.
This class assumes that segments connect all possible incoming directions to all possible outgoing directions. Child classes may override this behavior by overriding the connections
method. See the WireTokenizer
docstring for more details.
Tokenizer for tokens represented by a box of edge symbols.
edge_tokens
is a mapping from a side of the box, to the collection of symbols that may be used for that edge.
contents_tokenizer
is a function to determine the value of the extracted token, and is passed the entire box (including the edge) as its only parameter.
Yields the non-overlapping tokens found in the diagram
by the list of tokenizers
.
Install and update using pip:
pip install parse_2d