UTF-8 Encoded files cause fromXML(stream) parser to fail.
Closed this issue · 5 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
1. Build/Use a project that uses the XMappr.fromXML function
2. Identify the XML file in use.
3. open the file in notepad++ and change the encoding to UTF-8.
4. Do a file diff to see the special characters introduced by that encoding.
5. run the testing project again and get the error: org.xmappr.XmapprException:
Error reading XML stream: ParseError at [row,col]:[1,1]
6. change encoding back to ANSI and repeat test without failure.
Original issue reported on code.google.com by d...@morris2morris.com
on 19 Nov 2010 at 7:17
GoogleCodeExporter commented
Your editor produces a file with UTF-8 BOM:
http://www.w3.org/International/questions/qa-utf8-bom.en.php
Try deleting the first three characters.
Original comment by pe...@knego.net
on 19 Nov 2010 at 7:56
GoogleCodeExporter commented
This error is produced by UTF-8 BOM at the beginning of the file:
http://www.w3.org/International/questions/qa-utf8-bom.en.php
BOM is used to define byte order on different transport streams.
Xmappr is XML parsing library and is not concerned with transport protocol
issues.
Even JVM itself is not concerned with BOM:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
Original comment by peter.kn...@gmail.com
on 19 Nov 2010 at 8:09
- Changed state: WontFix
GoogleCodeExporter commented
That's understandable, I guess. And yes, deleting the first three characters
might work, but the easier part is just to change the encoding on the file to
ANSI. However, that removes the 'ease' in which the code works in the examples
where people at least need to know that if your file uses the UTF-8 Encoding,
they will have problems in the code with the provided example. It would just
be nice to see that 'gotcha' listed somewhere, if not as a bug, at least just
making users aware.
Original comment by d...@morris2morris.com
on 19 Nov 2010 at 8:54
GoogleCodeExporter commented
This is really a speciality of Microsoft software. Other systems don't do this.
And you are right - we can detect those threw bytes and just discard them.
Original comment by pe...@knego.net
on 19 Nov 2010 at 9:44
GoogleCodeExporter commented
Also, encoding in XML is defined by this header not by BOM:
<?xml version="1.0" encoding="utf-8"?>
Original comment by pe...@knego.net
on 19 Nov 2010 at 9:49