FasterXML/woodstox

XMLStreamReader's getCharacterOffset() with byte stream returns character offset instead of byte offset

midrare opened this issue · 3 comments

According to Oracle docs for Location.getCharacterOffset():

int getCharacterOffset()
Return the byte or character offset into the input source this location is pointing to. If the input source is a file or a byte stream then this is the byte offset into that stream, but if the input source is a character media then the offset is the character offset. Returns -1 if there is no offset available.

With woodstox 6.2.7, if you have a XMLStreamReader constructed with a ByteArrayInputStream, then reader.getLocation().getCharacterOffset() will return the character offset instead of the byte offset as it should.

I think this is an unfortunately the way things work if and when under the hood InputStream is accessed by constructing InputStreamReader (or similar) -- in which case parser itself uses Character-based input source. If so, it is impossible to reliably get access to byte-based offsets.

Woodstox does not decode directly from InputStream or other byte-sources so this is a fundamental limitation that probably cannot be resolved.

@cowtowncoder In that case, maybe it would be better to return -1 as specified by the API? That would be the correct behavior (though admittedly for if you were using it for debugging purposes it wouldn't matter so much).

In theory this would be possible (but would require changes to track the "true" original source).
But I am not sure I see the value in removing this piece of information for sole purpose of specification conformance; it is possible some users are using this feature, even if it is non-compliant.

So I probably will not proceed with this change.