`JsonLocation` consistently off by one character for many invalid JSON parsing cases
hal7df opened this issue · 9 comments
Issue
The JsonLocation
attached to a JsonProcessingException
thrown when parsing an invalid JSON string is consistently one character to the right of the invalid character, except in cases where the error is due to an unexpected EOF. This affects both JsonLocation.getCharOffset()
and JsonLocation.getColumnNr()
(presumably JsonLocation.getByteOffset()
as well, although I haven't tested that explicitly); because this information is used to construct the exception message, that is affected as well.
I first noticed this on Jackson 2.10.0, but the issue persists on Jackson 2.16.0.
Minimally reproducible example
package example;
import com.fasterxml.jackson.core.JsonLocation;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
public class JsonLocationTest {
public static void main(String[] args) {
for (String input : INVALID_JSON) {
try {
MAPPER.readTree(input);
} catch (JsonProcessingException jpe) {
JsonLocation location = jpe.getLocation();
System.out.printf(
"%s (at line %d, column %d):%n %s%n ",
jpe.getOriginalMessage(),
location.getLineNr(),
location.getColumnNr(),
input
);
// annotate the location of the error on the previous line
for (int i = 0; i < location.getCharOffset(); ++i) {
System.out.print(' ');
}
System.out.println('^');
}
}
}
private static final String[] INVALID_JSON = new String[] {
"{\"invalid\" \"json\"}",
"{\"also\", \"invalid\"}",
"{\"still\":[\"invalid\"[]]}",
"[\"json\":\"doesn't\",\"work\":\"like\",\"this\"]",
"{\"really\": \"funky💃\" \"unicode\"}",
"{",
"}"
};
private static final ObjectMapper MAPPER = new ObjectMapper();
}
Expected output
Unexpected character ('"' (code 34)): was expecting a colon to separate field name and value (at line 1, column 12):
{"invalid" "json"}
^
Unexpected character (',' (code 44)): was expecting a colon to separate field name and value (at line 1, column 8):
{"also", "invalid"}
^
Unexpected character ('[' (code 91)): was expecting comma to separate Array entries (at line 1, column 20):
{"still":["invalid"[]]}
^
Unexpected character (':' (code 58)): was expecting comma to separate Array entries (at line 1, column 8):
["json":"doesn't","work":"like","this"]
^
Unexpected character ('"' (code 34)): was expecting comma to separate Object entries (at line 1, column 22):
{"really": "funky💃" "unicode"}
^
Unexpected end-of-input: expected close marker for Object (start marker at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1, column: 1]) (at line 1, column 2):
{
^
Unexpected close marker '}': expected ']' (for root starting at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1]) (at line 1, column 1):
}
^
Actual output
(The unexpected end-of-input case here is correct IMO, just including it to be thorough)
Unexpected character ('"' (code 34)): was expecting a colon to separate field name and value (at line 1, column 13):
{"invalid" "json"}
^
Unexpected character (',' (code 44)): was expecting a colon to separate field name and value (at line 1, column 9):
{"also", "invalid"}
^
Unexpected character ('[' (code 91)): was expecting comma to separate Array entries (at line 1, column 21):
{"still":["invalid"[]]}
^
Unexpected character (':' (code 58)): was expecting comma to separate Array entries (at line 1, column 9):
["json":"doesn't","work":"like","this"]
^
Unexpected character ('"' (code 34)): was expecting comma to separate Object entries (at line 1, column 23):
{"really": "funky💃" "unicode"}
^
Unexpected end-of-input: expected close marker for Object (start marker at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1, column: 1]) (at line 1, column 2):
{
^
Unexpected close marker '}': expected ']' (for root starting at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1]) (at line 1, column 2):
}
^
Sounds like a flaw indeed. This is likely due to code not compensating for already read (invalid) character.
Unfortunately this may require fixing in quite a few places, both wrt:
- Cases of unexpected/invalid character (dozens of places per parser)
- 4 different parsing backends (... for JSON); byte-based, char-based, DataInput-backed, async
One thing that would be useful as the first step would be addition of (failing) test cases.
There are a few location-checking tests under src/test/java/com/fasterxml/jackson/core/read/
; tests with names like LocationXXX.java
@hal7df Thank you for the tests!
One thing I forgot to ask earlier: if you haven't yet done so (I don't think you have but just in case if you did just LMK), we'd need a CLA:
https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf
which is needed for the first contribution (but is good for all future contributions, so one time chore).
It's usually easiest to do by printing, filling & signing, scanning/photo, email document to cla
at fasterxml dot com.
If you could send it whenever you have a change that'd be great. Apologies for forgetting to mention this earlier; looking forwards to all the contributions!
And thank you once again for the unit tests.
@cowtowncoder I did notice the mention of the CLA in the contributing doc in the main Jackson repo, but it appeared to make an exception for test code changes. If you'd like one nonetheless I can probably get one to you after sometime the holidays (my employer is involved, so I'd have to run it up the chain when everyone gets back).
@hal7df Yeah you are right that CLA is really needed for code that we ship, and test code isn't covered.
So no hurry -- we just need it for non-test code contributions. We can until there's a PR for such.
In the meantime I've been able to fix many cases, as well as realized that there's one trickier case (that of unrecognized tokens).
Fixed most cases; 3 fail in a way that is bit trickier to fix and may need refactoring of decoding -- those I'll tackle for 2.17.
And one case wrt Unicode characters that affects byte-backed and async will also need bit more thought.
But the bulk have been resolved for 2.16 branch, to be included in eventual 2.16.2.
@hal7df Ok. So, turns out that about 2/3 are simple cases where last character just needs to be "unread" (offset adjusted) and things will work.
3 cases are bit more involved since they read invalid token; and while in theory location could be constructed before the first character is read, that'd be wasteful for the common case of not needing location. So will preferably try to adjust location after the fact. This, however, may be challenging for case of buffer boundaries etc.
But one failing case cannot be solved per se but needs to be document (if not already done): one showing case of non-ASCII Unicode character, Column number for byte-based input.
For byte-based input, column offset is (and has to remain) byte-based offset and not character based.
Ugggh. Fix here breaks something else -- ability to continue parsing in some cases with error recovery (single-character syntax problems). Only caught by jackson-jr
test, will need to add a test or two in jackson-core
itself and figure out if there is a way to resolve this problem.
Had to revert initial fix; this will also mean eventual fix will only go in 2.17, not earlier.
Added new test against regression wrt error recovery.
Hoping to fix in 2.17 very soon now.