Unclosed string literal not properly handled by StdLexical
martingd opened this issue · 6 comments
In scala.util.parsing.combinator.lexical.StdLexical
handling of unclosed / unterminated string literals does not seem to work as expected.
The token parser declared at the top of the code looks like this and is supposed to handle unclosed string literals by returning the value ErrorToken(unclosed string literal)
:
def token: Parser[Token] =
( identChar ~ rep( identChar | digit ) ^^ { case first ~ rest => processIdent(first :: rest mkString "") }
| digit ~ rep( digit ) ^^ { case first ~ rest => NumericLit(first :: rest mkString "") }
| '\'' ~ rep( chrExcept('\'', '\n', EofCh) ) ~ '\'' ^^ { case '\'' ~ chars ~ '\'' => StringLit(chars mkString "") }
| '\"' ~ rep( chrExcept('\"', '\n', EofCh) ) ~ '\"' ^^ { case '\"' ~ chars ~ '\"' => StringLit(chars mkString "") }
| EofCh ^^^ EOF
| '\'' ~> failure("unclosed string literal")
| '\"' ~> failure("unclosed string literal")
| delim
| failure("illegal character")
)
Here is a simple setup trying to use StdLexical
:
object Lexer extends App {
def lex(input: String) = {
val lexer = new StdLexical
var scanner: Reader[lexer.Token] = new lexer.Scanner(input)
while (!scanner.atEnd) {
println(scanner.first)
scanner = scanner.rest
}
}
}
Now, calling that with legal input works, here recognising an identifier and a string literal:
> lex(""" hello "world" """)
identifier hello
"world"
Passing an illegal character also works as expected:
> lex(""" hello € "world" """)
identifier hello
ErrorToken(illegal character)
"world"
However, the rule for an unterminated double (and single) qouted string does not seem to work and the lexer produces an ErrorToken(end of input) instead of the expected ErrorToken(unclosed string literal):
> lex(""" hello € "unterminated """)
identifier hello
ErrorToken(illegal character)
ErrorToken(end of input)
I guessed the problem was that the rules for unterminated strings use the failure
parser that allows backtracking but that should have sent us to the last failure("illegal character")
and btw inserting a cut (~!
) or alternatively using err
instead of failure
doesn't fix the issue.
EDIT
Parsing a string with a single quote character at the very end returns the expected token:
> lex(""" hello € """")
identifier hello
ErrorToken(illegal character)
ErrorToken(unclosed string literal)
But adding any character to the unclosed string literal causes ErrorToken(end of input)
to be emitted.
Have been debugging this with @JakobLyngPetersen and seem to have found the cause. Will make a pull request.
@SethTisue do you use milestones for the project and if yes, which should be assigned to this issue (something like 2.0.1)?
We're not using milestones currently, we're just merging pull requests as they come in and doing releases when someone asks.
@SethTisue we now have a fix for unterminated string literals.
Note about EofCh
First of all, it seems scala.util.parsing.combinator.lexical.Scanners.Scanner
will never read past the end of the its in: Reader[Char]
because it stops at in.atEnd
so the token
and whitespace
parsers in scala.util.parsing.combinator.lexical.StdLexical
will never encounter the EofCh
character. However, we have left the mentions of EofCh
in the token
and whitespace
parsers to avoid breaking something we might not fully understand.
Fix for Unclosed Strings
To fix the unclosed strings, instead of having these token rules…
class StdLexical extends Lexical with StdTokens {
// see `token` in `Scanners`
def token: Parser[Token] =
...
| '\'' ~ rep( chrExcept('\'', '\n', EofCh) ) ~ '\'' ^^ { case '\'' ~ chars ~ '\'' => StringLit(chars mkString "") }
| '\"' ~ rep( chrExcept('\"', '\n', EofCh) ) ~ '\"' ^^ { case '\"' ~ chars ~ '\"' => StringLit(chars mkString "") }
| EofCh ^^^ EOF
| '\'' ~> failure("unclosed string literal")
| '\"' ~> failure("unclosed string literal")
...
)
...
}
…we now have:
class StdLexical extends Lexical with StdTokens {
// see `token` in `Scanners`
def token: Parser[Token] =
...
| '\'' ~> rep( chrExcept('\'', '\n') ) >> { chars => stringEnd('\'', chars) }
| '\"' ~> rep( chrExcept('\"', '\n') ) >> { chars => stringEnd('\"', chars) }
| EofCh ^^^ EOF
...
)
...
/** Parses the final quote of a string literal or fails if it is unterminated. */
protected def stringEnd(quoteChar: Char, chars: List[Char]): Parser[Token] = {
{ elem(quoteChar) ^^^ StringLit(chars mkString "") } | err("unclosed string literal")
}
...
}
This works, breaks no existing tests, makes our new tests happy, and StdLexical
now returns ErrorToken(unclosed string literal)
for each unterminated string instead of returning multiple ErrorToken(illegal character)
tokens.
Fix for Unclosed Multi-Line Comments
We have tried to implement a similar fix for unclosed multi-line comments by changing this
def whitespace: Parser[Any] = rep[Any](
whitespaceChar
| '/' ~ '*' ~ comment
| '/' ~ '/' ~ rep( chrExcept(EofCh, '\n') )
| '/' ~ '*' ~ failure("unclosed comment")
)
to this
def whitespace: Parser[Any] = rep[Any](
whitespaceChar
| '/' ~ '*' ~ comment
| '/' ~ '/' ~ rep( chrExcept(EofCh, '\n') )
| '/' ~ '*' ~ rep( elem("", _ => true) ) ~> err("unclosed comment")
)
This works, makes our new tests happy, and the StdLexical
now returns one ErrorToken(unclosed comment)
when an unclosed comment is encoutered (consuming the remaining input).
However, this breaks test scala.util.parsing.combinator.gh56.test1
that for the input:
/* an unclosed comment
of multiple lines
just to check longString/lineContents
expects:
[1.1] failure: identifier expected
/* an unclosed comment
^
but now gets:
[4.7] failure: identifier expected
^
Basically the position of the error is reported at the end of the comment instead of at the beginning – probably because the whitespace rule previously did not return an ErrorToken
for the unclosed comment, but instead multiple ErrorToken
s flagging the the opening /
and *
characters as illegal.
Test scala.util.parsing.combinator.gh56.test2
fails in a similar fashion.
The question is whether the test behaviour with the changed comment handling in def whitespace
is actually correct? We believe so, and think the test assertion should be corrected.
What to do?
We could leave out the comment fix and put it in a separate issue and only make a PR for the string issue for now. What do you think?
@SethTisue I have created a separate issue for unclosed comment handling in StdLexical
and also crafted a solution for that. The fixes for #397 (this issue), #399 (unclosed comments handling), and #400 (changes to .gitignore
) are all ready, but I have not yet signed the Scala CLA so therefore I have not yet issued pull requests.