How to support Python style indentation?

Question

How to support Python style indentation?

Closed this issue a year ago · 3 comments

Are there any examples of handling indentation?

For example:
test.py

if a:
    if b:
      if c:
         if d:
             pass
         pass
      pass
    pass

$ python3 -m tokenize -e ./test.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,2:            NAME           'if'
1,3-1,4:            NAME           'a'
1,4-1,5:            COLON          ':'
1,5-1,6:            NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,6:            NAME           'if'
2,7-2,8:            NAME           'b'
2,8-2,9:            COLON          ':'
2,9-2,10:           NEWLINE        '\n'
3,0-3,6:            INDENT         '      '
3,6-3,8:            NAME           'if'
3,9-3,10:           NAME           'c'
3,10-3,11:          COLON          ':'
3,11-3,12:          NEWLINE        '\n'
4,0-4,9:            INDENT         '         '
4,9-4,11:           NAME           'if'
4,12-4,13:          NAME           'd'
4,13-4,14:          COLON          ':'
4,14-4,15:          NEWLINE        '\n'
5,0-5,13:           INDENT         '             '
5,13-5,17:          NAME           'pass'
5,17-5,18:          NEWLINE        '\n'
6,9-6,9:            DEDENT         ''
6,9-6,13:           NAME           'pass'
6,13-6,14:          NEWLINE        '\n'
7,6-7,6:            DEDENT         ''
7,6-7,10:           NAME           'pass'
7,10-7,11:          NEWLINE        '\n'
8,4-8,4:            DEDENT         ''
8,4-8,8:            NAME           'pass'
8,8-8,9:            NEWLINE        '\n'
9,0-9,1:            NL             '\n'
10,0-10,0:          DEDENT         ''
10,0-10,0:          ENDMARKER      ''

Answer 1 · 2022-03-05T09:30:02.000Z

There is no automatic indentation or location handling. You can have a rule with tags surrounding indentation, like this:

    @x space* @y something { indent = (y - x) / 4; ... }

Answer 2 · 2022-04-04T15:53:31.000Z

The way many parsers handle things like this is to keep track of the indentation levels in some state variable and to issue synthetic indent and unindent tokens whenever whitespace at the start of line is encountered that does not conform to the previous indentation level.

Answer 3 · 2022-04-05T04:03:18.000Z

Thanks all, I find a solution by using indent stack and tags, it works but there are many corner cases. When I can fully support tokenizing Python3.10, I will update this issue.

My current solution: https://github.com/lijunchen/pyser/blob/ead8f46a2847905d4757ed194c604d0ca493c2f0/src/tokenizer.re2c
Indent stack: https://matt.might.net/articles/standalone-lexers-with-lex/)