skvadrik/re2c

How to support Python style indentation?

Closed this issue · 3 comments

Are there any examples of handling indentation?

For example:
test.py

if a:
    if b:
      if c:
         if d:
             pass
         pass
      pass
    pass
$ python3 -m tokenize -e ./test.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,2:            NAME           'if'
1,3-1,4:            NAME           'a'
1,4-1,5:            COLON          ':'
1,5-1,6:            NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,6:            NAME           'if'
2,7-2,8:            NAME           'b'
2,8-2,9:            COLON          ':'
2,9-2,10:           NEWLINE        '\n'
3,0-3,6:            INDENT         '      '
3,6-3,8:            NAME           'if'
3,9-3,10:           NAME           'c'
3,10-3,11:          COLON          ':'
3,11-3,12:          NEWLINE        '\n'
4,0-4,9:            INDENT         '         '
4,9-4,11:           NAME           'if'
4,12-4,13:          NAME           'd'
4,13-4,14:          COLON          ':'
4,14-4,15:          NEWLINE        '\n'
5,0-5,13:           INDENT         '             '
5,13-5,17:          NAME           'pass'
5,17-5,18:          NEWLINE        '\n'
6,9-6,9:            DEDENT         ''
6,9-6,13:           NAME           'pass'
6,13-6,14:          NEWLINE        '\n'
7,6-7,6:            DEDENT         ''
7,6-7,10:           NAME           'pass'
7,10-7,11:          NEWLINE        '\n'
8,4-8,4:            DEDENT         ''
8,4-8,8:            NAME           'pass'
8,8-8,9:            NEWLINE        '\n'
9,0-9,1:            NL             '\n'
10,0-10,0:          DEDENT         ''
10,0-10,0:          ENDMARKER      ''

There is no automatic indentation or location handling. You can have a rule with tags surrounding indentation, like this:

    @x space* @y something { indent = (y - x) / 4; ... }

The way many parsers handle things like this is to keep track of the indentation levels in some state variable and to issue synthetic indent and unindent tokens whenever whitespace at the start of line is encountered that does not conform to the previous indentation level.

Thanks all, I find a solution by using indent stack and tags, it works but there are many corner cases. When I can fully support tokenizing Python3.10, I will update this issue.

My current solution: https://github.com/lijunchen/pyser/blob/ead8f46a2847905d4757ed194c604d0ca493c2f0/src/tokenizer.re2c
Indent stack: https://matt.might.net/articles/standalone-lexers-with-lex/)