Bugs with end of file detection
Closed this issue · 1 comments
I found a strange behavior with #'\\Z'
, I wonder if it is a bug.
((insta/parser
"Paragraph = NonBlankLine+ BlankLine+
BlankLine = #'[ \\t]'* EOL
NonBlankLine = #'\\S'+ EOL
EOL = (#'\\n' | EOF)
EOF = #'\\Z'")
"abc\ndef\n")
;; The "end of file" is matched before "\n" in the parsed result.
=>
[:Paragraph
[:NonBlankLine "a" "b" "c" [:EOL "\n"]]
[:NonBlankLine "d" "e" "f" [:EOL [:EOF ""]]] ; <-- here
[:BlankLine [:EOL "\n"]]] ; <-- and here
This other approach which uses the negative lookahead does put the "\n"
in the right place in the result, but there is another problem: The BlankLine
is missing in the result. That may be a bug of instaparse.
((insta/parser
"Paragraph = NonBlankLine+ BlankLine+
BlankLine = #'[ \\t]'* EOL
NonBlankLine = #'\\S'+ EOL
EOL = (#'\\n' | EOF)
EOF = !#'.'")
"abc\ndef\n")
=>
[:Paragraph [:NonBlankLine "a" "b" "c" [:EOL "\n"]]
[:NonBlankLine "d" "e" "f" [:EOL "\n"]]]
;; There is no BlankLine anymore in the result, but parser says it matches.
I am using the version 1.4.9
of instaparse.
In general, I've never used #"\Z". I don't offhand see how it would be useful, since instaparse is always going to try to match against the whole string anyway.
:
But if that's what you want to do, I think you want to make the Z
a lower-case z
.
https://stackoverflow.com/questions/2707870/whats-the-difference-between-z-and-z-in-a-regular-expression-and-when-and-how
The upper-case one matches both before and after the final newline character. That's Java (and therefore Clojure) behavior:
> (re-seq #"\Z" "\n")
("" "")
Some other options available to you are:
- Don't put the EOF inside of EOL. Instead, do
Paragraph = NonBlankLine+ BlankLine+ EOF
- Stick your own unique character at the end of the string before parsing, and match against that as your EOF.
As for the negative lookahead example, there's nothing in your grammar to force that it must end with an EOF, so the parse it produced is perfectly valid. Also, watch out: in Java/Clojure the default behavior of #"." is that the . isn't matched by newline characters.