Unclosed string literal not properly handled by StdLexical

Question

Unclosed string literal not properly handled by StdLexical

martingd opened this issue 3 years ago · 6 comments

In scala.util.parsing.combinator.lexical.StdLexical handling of unclosed / unterminated string literals does not seem to work as expected.

The token parser declared at the top of the code looks like this and is supposed to handle unclosed string literals by returning the value ErrorToken(unclosed string literal):

def token: Parser[Token] =
  ( identChar ~ rep( identChar | digit )              ^^ { case first ~ rest => processIdent(first :: rest mkString "") }
  | digit ~ rep( digit )                              ^^ { case first ~ rest => NumericLit(first :: rest mkString "") }
  | '\'' ~ rep( chrExcept('\'', '\n', EofCh) ) ~ '\'' ^^ { case '\'' ~ chars ~ '\'' => StringLit(chars mkString "") }
  | '\"' ~ rep( chrExcept('\"', '\n', EofCh) ) ~ '\"' ^^ { case '\"' ~ chars ~ '\"' => StringLit(chars mkString "") }
  | EofCh                                             ^^^ EOF
  | '\'' ~> failure("unclosed string literal")
  | '\"' ~> failure("unclosed string literal")
  | delim
  | failure("illegal character")
  )

Here is a simple setup trying to use StdLexical:

object Lexer extends App {
    def lex(input: String) = {
        val lexer = new StdLexical
        var scanner: Reader[lexer.Token] = new lexer.Scanner(input)
        while (!scanner.atEnd) {
            println(scanner.first)
            scanner = scanner.rest
        }
    }
}

Now, calling that with legal input works, here recognising an identifier and a string literal:

> lex(""" hello "world" """)
identifier hello
"world"

Passing an illegal character also works as expected:

> lex(""" hello € "world" """)
identifier hello
ErrorToken(illegal character)
"world"

However, the rule for an unterminated double (and single) qouted string does not seem to work and the lexer produces an ErrorToken(end of input) instead of the expected ErrorToken(unclosed string literal):

> lex(""" hello € "unterminated """)
identifier hello
ErrorToken(illegal character)
ErrorToken(end of input)

I guessed the problem was that the rules for unterminated strings use the failure parser that allows backtracking but that should have sent us to the last failure("illegal character") and btw inserting a cut (~!) or alternatively using err instead of failure doesn't fix the issue.

EDIT

Parsing a string with a single quote character at the very end returns the expected token:

> lex(""" hello € """")
identifier hello
ErrorToken(illegal character)
ErrorToken(unclosed string literal)

But adding any character to the unclosed string literal causes ErrorToken(end of input) to be emitted.

Answer 1 · 2021-06-09T15:01:05.000Z

Have been debugging this with @JakobLyngPetersen and seem to have found the cause. Will make a pull request.

Answer 2 · 2021-06-10T01:48:45.000Z

@SethTisue do you use milestones for the project and if yes, which should be assigned to this issue (something like 2.0.1)?

Answer 3 · 2021-06-10T04:18:07.000Z

We're not using milestones currently, we're just merging pull requests as they come in and doing releases when someone asks.

Answer 4 · 2021-06-10T10:50:07.000Z

@SethTisue we now have a fix for unterminated string literals.

Note about EofCh

First of all, it seems scala.util.parsing.combinator.lexical.Scanners.Scanner will never read past the end of the its in: Reader[Char] because it stops at in.atEnd so the token and whitespace parsers in scala.util.parsing.combinator.lexical.StdLexical will never encounter the EofCh character. However, we have left the mentions of EofCh in the token and whitespace parsers to avoid breaking something we might not fully understand.

Fix for Unclosed Strings

To fix the unclosed strings, instead of having these token rules…

class StdLexical extends Lexical with StdTokens {
  // see `token` in `Scanners`
  def token: Parser[Token] =
    ...
    | '\'' ~ rep( chrExcept('\'', '\n', EofCh) ) ~ '\'' ^^ { case '\'' ~ chars ~ '\'' => StringLit(chars mkString "") }
    | '\"' ~ rep( chrExcept('\"', '\n', EofCh) ) ~ '\"' ^^ { case '\"' ~ chars ~ '\"' => StringLit(chars mkString "") }
    | EofCh                                             ^^^ EOF
    | '\'' ~> failure("unclosed string literal")
    | '\"' ~> failure("unclosed string literal")
    ...
    )
   ...
}

…we now have:

class StdLexical extends Lexical with StdTokens {
  // see `token` in `Scanners`
  def token: Parser[Token] =
    ...
    | '\'' ~> rep( chrExcept('\'', '\n') ) >> { chars => stringEnd('\'', chars) }
    | '\"' ~> rep( chrExcept('\"', '\n') ) >> { chars => stringEnd('\"', chars) }
    | EofCh                                ^^^ EOF
    ...
    )
  ...
  /** Parses the final quote of a string literal or fails if it is unterminated. */
  protected def stringEnd(quoteChar: Char, chars: List[Char]): Parser[Token] = {
    { elem(quoteChar) ^^^ StringLit(chars mkString "") } | err("unclosed string literal")
  }
  ...
}

This works, breaks no existing tests, makes our new tests happy, and StdLexical now returns ErrorToken(unclosed string literal) for each unterminated string instead of returning multiple ErrorToken(illegal character) tokens.

Fix for Unclosed Multi-Line Comments

We have tried to implement a similar fix for unclosed multi-line comments by changing this

  def whitespace: Parser[Any] = rep[Any](
      whitespaceChar
    | '/' ~ '*' ~ comment
    | '/' ~ '/' ~ rep( chrExcept(EofCh, '\n') )
    | '/' ~ '*' ~ failure("unclosed comment")
    )

to this

  def whitespace: Parser[Any] = rep[Any](
      whitespaceChar
    | '/' ~ '*' ~ comment
    | '/' ~ '/' ~ rep( chrExcept(EofCh, '\n') )
    | '/' ~ '*' ~ rep( elem("", _ => true) ) ~> err("unclosed comment")
    )

This works, makes our new tests happy, and the StdLexical now returns one ErrorToken(unclosed comment) when an unclosed comment is encoutered (consuming the remaining input).

However, this breaks test scala.util.parsing.combinator.gh56.test1 that for the input:

/* an unclosed comment
of multiple lines
just to check longString/lineContents

expects:

[1.1] failure: identifier expected

/* an unclosed comment
^

but now gets:

[4.7] failure: identifier expected

      
^

Basically the position of the error is reported at the end of the comment instead of at the beginning – probably because the whitespace rule previously did not return an ErrorToken for the unclosed comment, but instead multiple ErrorTokens flagging the the opening / and * characters as illegal.

Test scala.util.parsing.combinator.gh56.test2 fails in a similar fashion.

The question is whether the test behaviour with the changed comment handling in def whitespace is actually correct? We believe so, and think the test assertion should be corrected.

What to do?

We could leave out the comment fix and put it in a separate issue and only make a PR for the string issue for now. What do you think?

Answer 5 · 2021-06-10T14:56:37.000Z

@SethTisue I have created a separate issue for unclosed comment handling in StdLexical and also crafted a solution for that. The fixes for #397 (this issue), #399 (unclosed comments handling), and #400 (changes to .gitignore) are all ready, but I have not yet signed the Scala CLA so therefore I have not yet issued pull requests.

Answer 6 · 2021-07-08T01:38:16.000Z

fixed by #402