Long tokens with internal newlines are missed by Scanner
Closed this issue · 4 comments
str = "+#{'0' * 10000}\n+"
s = EBNF::LL1::Scanner.new(StringIO.new(str))
s.scan(/\+.*\+/m)
# => nil
Increasing both :low_water
and :high_water
to a size larger than the token fixes this.
s = EBNF::LL1::Scanner.new(StringIO.new(str), :low_water => 20000, :high_water => 20000)
s.scan(/\+.*\+/m)
When increasing only :high_water
, there's a possibility that the previous scan/feed will leave us with a #rest
containing only part of the next token.
I'm not sure of a way to fix this, beyond giving up on low/high water scanning altogether; it seems like this would be a problem for any language with arbitrarily long multiline terminals.
It should be possible to see that the terminal has not been matched and continue loading until it is. But, increasing the limits through options passed rom the Turtle parser would be a short-term workaround.
It should be possible to see that the terminal has not been matched and continue loading until it is.
This was an initial thought of mine, as well, and you might be right. The problem I hit was that Lexer scans for each terminal successively, so a naive implementation would end up loading the whole input stream the first time any terminal is not matched.
Trying to scan all the terminals, and then feeding more of the input fails on cases like """[long string]"""
where the initial ""
matches an empty string in Turtle, and is chomped off the input before we feed the input.
Ideas?
We could add terminal definition option for the long strings to cause them to keep loading.mthis still leaves the potential for loading the rest of the file, if the literal is poorly formed, but any processor would have the same problem.