Regex for variable definition fails on string which includes semi colon or closing bracket
kdorsel opened this issue · 6 comments
Losing the initialization portion if there's a semicolon inside the string
My initial idea was negative lookahead in the Tokenize method.
(?:;(?!.*;))
This negative lookahead would fail if there's a semi colon in the comment portion of the line also.
But that gives me even weirder results with the assignment operator now showing up.
Thanks for reporting! I earlier already noticed strange behavior when there is a closing bracket in the string ')'
. You can see the failing ones and a few working ones here https://regex101.com/r/HN9oLJ/3
I tried the following to fix it, but it fails on different ones:
- Replace
\w+\(.*\)
with\w+\([^)]\)
thenWriteContents: STRING(1) := ')';
passes, butfbSample : FB_Sample(nId_Init := 11, fIn_Init := 33.44) ;
fails https://regex101.com/r/HN9oLJ/6. - Replace
\w+\(.*\)
with\w+\(.?\)
thenWriteContents: STRING(1) := ')';
passes, butfbSample : FB_Sample(nId_Init := 11, fIn_Init := 33.44) ;
also fails https://regex101.com/r/HN9oLJ/7.
I also tried with your negative lookahead but that also doesn't lead to the desired result https://regex101.com/r/HN9oLJ/4
It's a though one.
Ok, this was an interesting one!
This seems to work. Add a negative look ahead/behind for the quotes
(?<!['"])
and (?!["'])
and also using the negative lookahead for the semi colon (?:;(?!.*;))
. This will still fail is there's a semi colon in the comment.
I also removed the (?s)
single line modifier. I'm not too sure the purpose of this one... But if needed the negative look ahead just needs to be modified to not match new lines with the dot.
https://regex101.com/r/HN9oLJ/8
This is my attempt to solve the semi colon in the comment, but still needs some work...
https://regex101.com/r/HN9oLJ/9
Any other edge cases will surely pop up as needed 😆
Yeah its a tricky one 😁 . That's why the tests are so convenient. It often happens that a small change of the regex pattern can have very large consequences for other variable declarations.
What I usually do is:
- Add a failing test and run it.
- Use regex101.com and/or https://www.debuggex.com/ to find the right pattern with a few examples
- Run the tests again. If it fails go to two. Else 🥳
I think this might be easier to do in two steps. The first step would be to check for a string like initializing. Either '
or "
. If found remove it and apply the current regex to part it out. The string check would have to make sure that any FB initializing with string values don't get caught.
Because the major issue I see right now is supporting those special characters inside a string. )
and ;
.
https://regex101.com/r/EzvPSi/2
That seems like a good solution. Or else the regex will become very complex.
Ok, I'll look at the two step solution and open up a PR.