jqlang/jq

Lexer bug: `e100` mistaken for numeric literal when it should be an identifier

BH1SCW opened this issue · 13 comments

I found a bug: version: jq-1.5-1-a5b5cbe os: Ubuntu 16.04LTS
testcase1:

{
"test": {
"e100": {
"car9": {
"enabled": 1
}
}
}
}
command: cat test.json| jq '.test.e100.car9'

jq: error: Invalid numeric literal at EOF at line 1, column 5 (while parsing '.e100') at , line 1:
.test.e100.car9
jq: error: syntax error, unexpected LITERAL, expecting $end (Unix shell quoting issues?) at , line 1:
.test.e100.car9
jq: 2 compile errors

testcase2
{
"test": {
"e100a": {
"car9": {
"enabled": 1
}
}
}
}
command: cat test.json| jq '.test.e100a.car9'
it will works.
Please help to fix this, Thanks!

I would agree that this is a bug, but you can easily work around it by writing .["e100a"].

The problem arises because jq sees .e100 as part of a numeric literal, and gets confused. That's at least what the error message indicates:

Invalid numeric literal ... (while parsing '.e100') 

The point is that .e100a looks like a floating point

but I really can't change e100 to e100a, it's a car type, so is it possible to fix this ?

Sorry, I meant .[“e100”]

not work by this command:

cat test.json| jq '.test.["e100"].car9'

jq: error: syntax error, unexpected '[', expecting FORMAT or QQSTRING_START (Unix shell quoting issues?) at , line 1:
.test.["e100"].car9
jq: 1 compile error

You really should read the documentation. In the meantime:

$ jq -c '.test | .["e100"] | .car9' test.json
{"enabled":1}

Shame on me ~
Anyway many thanks to you, Boss~

Also good:

$ jq '.test["e100"].car9' test.json

and

$ jq '.test."e100".car9' test.json

This is the bag regexp in src/lexer.l that is causing this:

[0-9.]+([eE][+-]?[0-9]+)? {
   yylval->literal = jv_parse_sized(yytext, yyleng); return LITERAL;
}

It should be:

[0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)? {
   yylval->literal = jv_parse_sized(yytext, yyleng); return LITERAL;
}

We probably also need a rule that matches invalid numbers and produces an error. E.g., 12.e5' is not a valid number, but with only the above change jq parses that as an attempt to index the number 12with the key"e5"`, which is clearly not right:

$ ./jq -n '12.e5'
jq: error (at <unknown>): Cannot index number with string "e5"
$ 

Can you try this patch:

diff --git a/src/lexer.l b/src/lexer.l
index 6b9164b..999ce37 100644
--- a/src/lexer.l
+++ b/src/lexer.l
@@ -86,10 +86,12 @@ struct lexer_param;
   yylval->literal = jv_string_sized(yytext + 1, yyleng - 1); return FORMAT;
 }

-[0-9.]+([eE][+-]?[0-9]+)? {
+[0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)? {
    yylval->literal = jv_parse_sized(yytext, yyleng); return LITERAL;
 }

+[0-9]+\.([eE][+-]?[0-9]+)? { return BADNUM; }
+
 "\"" {
   yy_push_state(IN_QQSTRING, yyscanner);
   return QQSTRING_START;
diff --git a/src/parser.y b/src/parser.y
index 78782dd..f235a7e 100644
--- a/src/parser.y
+++ b/src/parser.y
@@ -47,6 +47,7 @@ struct lexer_param;


 %token INVALID_CHARACTER
+%token BADNUM
 %token <literal> IDENT
 %token <literal> FIELD
 %token <literal> LITERAL
@@ -709,6 +710,10 @@ Term '[' ':' Exp ']' %prec NONOPT {
 LITERAL {
   $$ = gen_const($1);
 } |
+BADNUM {
+  FAIL(@$, "Invalid numeric literal");
+  $$ = gen_noop();
+} |
 String {
   $$ = $1;
 } |

?

Also found an issue with parse error: Invalid numeric literal at line 1295060, column 909 (this seems like the same error. If not I will open a new case)

This came from a file I was using a .py script to combine some logs. I got the error there with "UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2338: character maps to " which adding encoding="utf8") to my open command allowed for processing.

jq is insanely annoying with number interpretation. I don't use jq often, but whenever I do, this trips me up.

Thanks all, this is awesome after long time.