How to force "longest match" in lexer (like (F)lex)

Question

How to force "longest match" in lexer (like (F)lex)

randomouscrap98 opened this issue 3 years ago · 4 comments

I have a simple lexer defined at https://github.com/randomouscrap98/contentapi/blob/master/contentapi/Search/Parser/QueryToken.cs

I have keywords like IN, AND, OR, etc, which I of course don't want to be usable as identifiers (here called FIELD).

In (f)lex, I know that the longest match will be the one that is returned, and thus an identifier called, for instance, notequal will not match the keyword not. But using the latest release 2.8, it seems that any rule defined first will always be the one that succeeds, regardless of length (ie it's not greedy). Thus, the simple string notequal produces two tokens: NOT=not and FIELD=equal.

I can't seem to find a way to change this behavior, or design rules that will work for this. If I move the FIELD definition earlier, then not becomes a field rather than the keyword not.

Am I missing some kind of setting, or is this the intended behavior, or is there a way to design this "properly" with this library, or is it something else? The way this is failing is: I have a built parser, and to check arbitrary fields for correctness outside of parsing, I sometimes call parser.Lexer.Tokenize(input) (like at: https://github.com/randomouscrap98/contentapi/blob/master/contentapi/Search/Parser/SearchQueryParser.cs#L35).

If it will help, the entirety of the code dealing with sly is located in that https://github.com/randomouscrap98/contentapi/tree/master/contentapi/Search/Parser directory, so you should be able to pull the code out and create a project without any issues.

Answer 1 · 2021-11-27T08:47:36.000Z

Hello @randomouscrap98 ,

indeed CSLY regex lexer evaluate lexeme in order they are defined and returns the first that match.
In your case your trapped as :

if you first define NOT keyword it will hides notequal
if you first define some kind of identifier not will be considered as an id and not a NOT keyword

I would advise to instead use the generic lexer that will natively match your expected behavior

Answer 2 · 2021-11-27T09:13:42.000Z

i will unestigate if this could be a regex lexer parameter.

Answer 3 · 2021-11-27T18:14:40.000Z

Thank you for the quick reply!

There's the word boundary metacharacter \b that I was going to use, but I wasn't sure if the regex is being passed straight to System.Text.Regex() or if it's transformed and would cause issues. I should try it, I was just really focused on other things last night. I'll get back to you

Answer 4 · 2021-11-27T18:32:16.000Z

OK, the \b after each keyword worked for my particular use-case. It won't work for everyone's use case of course, as some tokenizers might rely on the longest match for non-word characters. Your investigation for regex settings or otherwise could prove very useful.

I think there should be a note somewhere in the wiki (unless there already is) detailing that the behavior for the regex lexer doesn't match (f)lex, and that it is instead whichever is the first rule that matches. I really appreciate the clarification that the order is definitely "first match", as that allowed me to fix my definitions. Thank you!