Engelberg/instaparse

Bugs with end of file detection

Closed this issue · 1 comments

I found a strange behavior with #'\\Z', I wonder if it is a bug.

((insta/parser
   "Paragraph = NonBlankLine+ BlankLine+
    BlankLine = #'[ \\t]'* EOL
    NonBlankLine = #'\\S'+ EOL
    EOL = (#'\\n' | EOF)
    EOF = #'\\Z'")
 "abc\ndef\n")

;; The "end of file" is matched before "\n" in the parsed result.
=> 
[:Paragraph
 [:NonBlankLine "a" "b" "c" [:EOL "\n"]]
 [:NonBlankLine "d" "e" "f" [:EOL [:EOF ""]]] ; <-- here
 [:BlankLine [:EOL "\n"]]]                    ; <-- and here

This other approach which uses the negative lookahead does put the "\n" in the right place in the result, but there is another problem: The BlankLine is missing in the result. That may be a bug of instaparse.

((insta/parser
   "Paragraph = NonBlankLine+ BlankLine+
    BlankLine = #'[ \\t]'* EOL
    NonBlankLine = #'\\S'+ EOL
    EOL = (#'\\n' | EOF)
    EOF = !#'.'")
 "abc\ndef\n")

=>
[:Paragraph [:NonBlankLine "a" "b" "c" [:EOL "\n"]]
            [:NonBlankLine "d" "e" "f" [:EOL "\n"]]]
;; There is no BlankLine anymore in the result, but parser says it matches.

I am using the version 1.4.9 of instaparse.

In general, I've never used #"\Z". I don't offhand see how it would be useful, since instaparse is always going to try to match against the whole string anyway.
:
But if that's what you want to do, I think you want to make the Z a lower-case z.
https://stackoverflow.com/questions/2707870/whats-the-difference-between-z-and-z-in-a-regular-expression-and-when-and-how

The upper-case one matches both before and after the final newline character. That's Java (and therefore Clojure) behavior:

> (re-seq #"\Z" "\n")
("" "")

Some other options available to you are:

  • Don't put the EOF inside of EOL. Instead, do Paragraph = NonBlankLine+ BlankLine+ EOF
  • Stick your own unique character at the end of the string before parsing, and match against that as your EOF.

As for the negative lookahead example, there's nothing in your grammar to force that it must end with an EOF, so the parse it produced is perfectly valid. Also, watch out: in Java/Clojure the default behavior of #"." is that the . isn't matched by newline characters.