danipen/TextMateSharp

How to correctly parse the text in the specified range?

pengsongkun741 opened this issue · 8 comments

I try to use this tool to parse Json grammar。
When I parse the complete Json text, I can get the correct result, but I want to parse the grammar of a certain line alone and cannot get the correct result.
I guess it should be StackElement in IGrammar.TokenizeLine(string lineText, StackElement prevState) can be used to contact context analysis.

How to deal with such problems?
Thanks a lot!

Hi @pengsongkun741.

The call to IGrammar.TokenizeLine(string lineText, StackElement prevState) should allow null values for the prevState param for those cases you want to parse a single line.

The line you are parsing should be well-formed from a lexer point of view.

If not, you should include some prev/next lines to build a well-formed text chunk, doing the following to maintain a parsing context:

StackElement ruleStack = null;

foreach (string line in linesToParse)
{
    ITokenizeLineResult result = grammar.TokenizeLine(line, ruleStack);
    ruleStack = result.RuleStack;

    // process the result.GetTokens() for the line you're interested in
}

Demo.zip

Hello, I restored this scene with a simple Demo.
Read a Json file from the first line and the second line respectively, the results of the analysis of name are different.
The result read from the second line is wrong. Is my usage wrong?

Thanks for your reply!

Hi @pengsongkun741,

The parser behaves correctly.

When you remove the first line, the JSON is not well-formed. That means that the parser isn't able to correctly identify some json-specific rules.

As you can see, VSCode behaves in the same way (it uses the same implementation). The left file is not well formed, so the parser is not able to identify the support.type.property-name.json scope.

image

Sorry, it should be my usage method is wrong.
I found that the parsing speed of grammar.TokenizeLine() method is very fast, I can parse all the text at once through it, and then get the Token I need from it. Instead of just parsing part of the text.

What I considered before was when the text is very large, if the entire text needs to be parsed, the burden on the parser will be great.
So we generally only parse the visible area or changed text , this is a way to ensure the performance of the editor.

Thanks again for your answers. In addition, I would like to ask when the nuget package will be released for this project, which will introduce the textmate parser into the .NET ecosystem. It is very meaningful for us IDE developers!

It takes about 300-400ms to parse a 10,000-line Json completely. Is this necessary consumption? I can't feel any performance issues when opening large json files with vscode.
Is there a way to get the correct result without parsing the entire content?

If you're using this lib for an IDE, I think the way to go should be writing a smart tokenization support.

AFAIK VSCode is pretty well optimized.

I think it parses the viewport first (the visible editor lines), and then it parses the whole document in the background, and then updates the viewport again.

When the file is edited I think it backwards to find a good starting stack, as an starting point to parse. (I speak by heart).

This is how VSCode uses the textmate parser, and performs the tokenization:

https://github.com/microsoft/vscode/blob/94c9ea46838a9a619aeafb7e8afd1170c967bb55/src/vs/editor/common/model/textModel.ts#L191

https://github.com/microsoft/vscode/blob/94c9ea46838a9a619aeafb7e8afd1170c967bb55/src/vs/editor/common/model/textModelTokens.ts#L369

About building a nuget package, this is something we'll do in the short-mid term. Just out of curiosity, which IDE are you building? ;-)

(1)Thank you for your reminder. I will analyze how to implement the optimization plan later.

(2)We are developing a cloud-native IDE based on the .NET platform.The official version has not been released yet.