XMLStreamReader's getCharacterOffset() with byte stream returns character offset instead of byte offset

Question

XMLStreamReader's getCharacterOffset() with byte stream returns character offset instead of byte offset

midrare opened this issue 3 years ago · 3 comments

According to Oracle docs for Location.getCharacterOffset():

int getCharacterOffset()
Return the byte or character offset into the input source this location is pointing to. If the input source is a file or a byte stream then this is the byte offset into that stream, but if the input source is a character media then the offset is the character offset. Returns -1 if there is no offset available.

With woodstox 6.2.7, if you have a XMLStreamReader constructed with a ByteArrayInputStream, then reader.getLocation().getCharacterOffset() will return the character offset instead of the byte offset as it should.

Answer 1 · 2022-01-18T20:22:26.000Z

I think this is an unfortunately the way things work if and when under the hood InputStream is accessed by constructing InputStreamReader (or similar) -- in which case parser itself uses Character-based input source. If so, it is impossible to reliably get access to byte-based offsets.

Woodstox does not decode directly from InputStream or other byte-sources so this is a fundamental limitation that probably cannot be resolved.

Answer 2 · 2022-01-22T21:17:00.000Z

@cowtowncoder In that case, maybe it would be better to return -1 as specified by the API? That would be the correct behavior (though admittedly for if you were using it for debugging purposes it wouldn't matter so much).

Answer 3 · 2022-01-27T17:09:11.000Z

In theory this would be possible (but would require changes to track the "true" original source).
But I am not sure I see the value in removing this piece of information for sole purpose of specification conformance; it is possible some users are using this feature, even if it is non-compliant.

So I probably will not proceed with this change.