lark-parser/Lark.js

Incorrect column info for unexpected token exception

Closed this issue · 9 comments

Grammar file: https://github.com/opencybersecurityalliance/kestrel-lang/blob/release/src/kestrel/syntax/kestrel.lark
Generated parser:
kestrelParser.js.zip

When parsing this statement var=get, the parser throws the unexpected token exception with

e.line =1
e.column=5

However, the column should be 7.
image

Same incorrect column info for the following test strings.
var=get file, e.column is 7, but should be 12.
var=get file from, e.column is 14, but should be 17.
var=get file from abc, e.column is 19, but should be 21.

@erezsh would you please take a look at this issue? Thanks a lot!

I tried the first example you gave, and I got

  token: Token {
    type: '$END',
    start_pos: 8,
    value: '',
    line: 1,
    column: 9,
    end_line: 1,
    end_column: 13,
    end_pos: 12
  },

This is the same answer you get from the Python version.

It's not the end of the file (you can find that easily on your own), but the last valid position the parser was able to reach.

We can argue if that's the right thing to return or not, but it seems like everything is working in order.

(I don't know why you got 7. Make sure you're using the latest commit)

The version I use is 0.1.3. This is what I got for statement var=get.
image

I also tried to install lark-js again from repo using command pip3 install -e git+https://github.com/lark-parser/Lark.js.git#egg=lark-js, and the result is the same..

Can you post a reproducing script? (a js file that, when run, reproduces the error. Plus the grammar file ofc)

Sure. The grammar file and the generated parser JS file is attached in the Description field of this issue.

My code to do parsing looks like below.

const kestrel_parser = require('./parser/kestrelParser');
const {get_parser, UnexpectedCharacters, UnexpectedToken} = kestrel_parser;
const parser = get_parser({keep_all_tokens: true});
function App() {
  let treeData = null;
  let errorMsg = '';
  function handle_errors(e) {
    console.debug(e.line, e.column)
    if (e instanceof UnexpectedCharacters) {
      if (errorMsg.length === 0) errorMsg = `Unexpected characters "${e.char}" at position ${e.column}`;
    } else if (e instanceof UnexpectedToken) {
      // print the 1st encountered error
      if (errorMsg.length === 0) errorMsg = `Unexpected token "${e.token.value}" at ${e.token.type} position ${e.column}, expected ${[...e.expected].join(',')}`;
    } else if (e instanceof SyntaxError) {
      console.debug(e)
    } else {
      console.debug("unknown error:", e.constructor.name)
    }
    // return ture to keep parsing
    return true;
  }

  try {
    treeData = parser.parse("var=get", null, handle_errors).children[0];
  } catch (e) {
    console.debug("uncaught error:", e)
  }
}

I don't see the problem?

For var=get file it's 9

For var=get it's 5

Everything seems in order

Okay, so the column means the token "start" position? Hm..then what I need should be end_pos. Thanks.

Yes, it's the start of the last valid position, which in this case is the start of the token that caused the error.

(to the best of my memory)