GB18030 false positive with WINDOWS-1252 data set
Opened this issue · 4 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
1. Pass UniversalDetector a byte buffer for WINDOWS-1252 containing a series of
degree symbols and character / numbers
e.g. {91, -80, 52, -80, 48, -80, 84, -80, 67, -80, 67, -80, 48, -80, 67, -80, 84}
2. Call UniversalDetector#getDetectedCharset(), it should be WINDOWS-1252, but
instead returns GB18030.
See attached unit test for minimal reproduction test case.
What is the expected output? What do you see instead?
Expected output from UniversalDetector#getDetectedCharset() is "WINDOWS-1252,"
but instead is "GB18030."
What version of the product are you using? On what operating system?
I'm using version 1.0.3 on 64-bit Ubuntu 11.4 (Natty) with default kernel 2.6.38-10-generic. The JDK I'm currently running is 1.6.0_23-x64.
Original issue reported on code.google.com by icw...@gmail.com
on 13 Jul 2011 at 4:34
GoogleCodeExporter commented
Unit test attached
Original comment by icw...@gmail.com
on 13 Jul 2011 at 4:41
Attachments:
GoogleCodeExporter commented
Experienced the same issue. Changing the buffersize for reading the inputstream
from 4096 to 128 solved the problem. The error occurred with buffer sizes of
253 and above.
Original comment by eman0...@gmail.com
on 28 Feb 2012 at 9:39
GoogleCodeExporter commented
[deleted comment]
GoogleCodeExporter commented
Changing the buffersize did not solve the issue on real files.
The workaround I am currently using is to detect if one or more degree
characters (°) are present in the byte stream(buf[i] == (byte) 0xB0). If true
and if the detector returns "GB18030", I use "WINDOWS-1252" instead.
This gives good results (as long as you do not have to detect GB18030 encoded
files)
Original comment by juliende...@gmail.com
on 29 Jan 2015 at 10:20