How to support Python style indentation?
Closed this issue · 3 comments
Are there any examples of handling indentation?
For example:
test.py
if a:
if b:
if c:
if d:
pass
pass
pass
pass
$ python3 -m tokenize -e ./test.py
0,0-0,0: ENCODING 'utf-8'
1,0-1,2: NAME 'if'
1,3-1,4: NAME 'a'
1,4-1,5: COLON ':'
1,5-1,6: NEWLINE '\n'
2,0-2,4: INDENT ' '
2,4-2,6: NAME 'if'
2,7-2,8: NAME 'b'
2,8-2,9: COLON ':'
2,9-2,10: NEWLINE '\n'
3,0-3,6: INDENT ' '
3,6-3,8: NAME 'if'
3,9-3,10: NAME 'c'
3,10-3,11: COLON ':'
3,11-3,12: NEWLINE '\n'
4,0-4,9: INDENT ' '
4,9-4,11: NAME 'if'
4,12-4,13: NAME 'd'
4,13-4,14: COLON ':'
4,14-4,15: NEWLINE '\n'
5,0-5,13: INDENT ' '
5,13-5,17: NAME 'pass'
5,17-5,18: NEWLINE '\n'
6,9-6,9: DEDENT ''
6,9-6,13: NAME 'pass'
6,13-6,14: NEWLINE '\n'
7,6-7,6: DEDENT ''
7,6-7,10: NAME 'pass'
7,10-7,11: NEWLINE '\n'
8,4-8,4: DEDENT ''
8,4-8,8: NAME 'pass'
8,8-8,9: NEWLINE '\n'
9,0-9,1: NL '\n'
10,0-10,0: DEDENT ''
10,0-10,0: ENDMARKER ''
There is no automatic indentation or location handling. You can have a rule with tags surrounding indentation, like this:
@x space* @y something { indent = (y - x) / 4; ... }
The way many parsers handle things like this is to keep track of the indentation levels in some state variable and to issue synthetic indent
and unindent
tokens whenever whitespace at the start of line is encountered that does not conform to the previous indentation level.
Thanks all, I find a solution by using indent stack and tags, it works but there are many corner cases. When I can fully support tokenizing Python3.10, I will update this issue.
My current solution: https://github.com/lijunchen/pyser/blob/ead8f46a2847905d4757ed194c604d0ca493c2f0/src/tokenizer.re2c
Indent stack: https://matt.might.net/articles/standalone-lexers-with-lex/)