Long tokens with internal newlines are missed by Scanner

Question

Long tokens with internal newlines are missed by Scanner

Closed this issue 9 years ago · 4 comments

str = "+#{'0' * 10000}\n+"
s = EBNF::LL1::Scanner.new(StringIO.new(str))
s.scan(/\+.*\+/m)
# => nil

Increasing both :low_water and :high_water to a size larger than the token fixes this.

s = EBNF::LL1::Scanner.new(StringIO.new(str), :low_water => 20000, :high_water => 20000)
s.scan(/\+.*\+/m)

When increasing only :high_water, there's a possibility that the previous scan/feed will leave us with a #rest containing only part of the next token.

I'm not sure of a way to fix this, beyond giving up on low/high water scanning altogether; it seems like this would be a problem for any language with arbitrarily long multiline terminals.

Answer 1 · 2015-09-04T03:29:07.000Z

It should be possible to see that the terminal has not been matched and continue loading until it is. But, increasing the limits through options passed rom the Turtle parser would be a short-term workaround.

Answer 2 · 2015-09-04T14:22:26.000Z

It should be possible to see that the terminal has not been matched and continue loading until it is.

This was an initial thought of mine, as well, and you might be right. The problem I hit was that Lexer scans for each terminal successively, so a naive implementation would end up loading the whole input stream the first time any terminal is not matched.

Trying to scan all the terminals, and then feeding more of the input fails on cases like """[long string]""" where the initial "" matches an empty string in Turtle, and is chomped off the input before we feed the input.

Ideas?

Answer 3 · 2015-09-04T14:32:05.000Z

We could add terminal definition option for the long strings to cause them to keep loading.mthis still leaves the potential for loading the rest of the file, if the literal is poorly formed, but any processor would have the same problem.

Answer 4 · 2015-10-21T22:34:04.000Z

I believe this issue is fully closed by 782aead.