XML parse error
Closed this issue · 16 comments
Originally reported on Google Code with ID 6
=> What steps will reproduce the problem?
1. Run the parsing data with following code
public static void main(String[] args) throws Exception {
File dumpFile = new File(PATH_TO_DUMP_FILE);
File outputDirectory = new File(TARGET_DIRECTORY);
boolean overwriteExisting = OVERWRITE_EXISTING_FILES;
JWKTL.parseWiktionaryDump(dumpFile, outputDirectory, overwriteExisting);
2. Using the 2 latest dump datafiles from Wiktionary
enwiktionary-20140504-pages-articles.xml
enwiktionary-latest-pages-articles.xml
=> What is the expected output? What do you see instead?
INFO: Parsed 775000 pages
Exception in thread "main" de.tudarmstadt.ukp.jwktl.api.WiktionaryException: XML parse
error
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:140)
at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:74)
at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:143)
at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:117)
at ParsingData.main(ParsingData.java:12)
Caused by: org.xml.sax.SAXParseException; lineNumber: 34869191; columnNumber: 5; Invalid
byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:131)
... 4 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
of 4-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
... 14 more
=> What version of the product are you using? On what operating system?
<dependency>
<groupId>de.tudarmstadt.ukp.jwktl</groupId>
<artifactId>jwktl</artifactId>
<version>1.0.0</version>
</dependency>
=> Please provide any additional information below.
However, the parsing went through successfully with the dump file "enwiktionary-20140415-pages-articles.xml"
Reported by ngoc@fbk.eu
on 2014-05-21 01:34:15
Seems like someone has typed a non-UTF-8 character into a Wiktionary article, which
hasn't be cleaned by the database dump application. That is, there is a 4 byte character
sequence in the latest.xml, which does not follow the expected UTF-8 sequence pattern
(11110xxx 10xxxxxx 10xxxxxx 10xxxxxx). Best solution is to remove these invalid characters.
A quick&dirty hack might be http://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file
A more elaborate idea is http://www.mkyong.com/java/sax-error-malformedbytesequenceexception-invalid-byte-1-of-1-byte-utf-8-sequence/
I don't know what helps and if there's an easy way of stripping non-UTF-8 characters
from the input file easily. Would be nice if you could report back and potentially
submit a patch, since I suspect that other users will run in the same issue.
Reported by chmeyer.de
on 2014-05-21 10:03:53
- Status changed:
Accepted
- Labels added: Component-Parser
Probably fixed with jwktl-1.0.1. Please try again using the new version. Note that AFAIK
the UTF-8 bug still exists, but it is either ignored by the current xerces version
or the current XML dump has been fixed w.r.t. this issue.
Reported by chmeyer.de
on 2014-09-30 12:27:24
- Status changed:
Fixed
Hi, I tried 1.0.1 version, but it fails. I tried to remove non-utf-8 character, but it fails again.
I try to build version 1.0.2-SNAPSHOT from source and it fails with java.lang.ArrayIndexOutOfBoundsException: 2048 exception.
Am I doing something wrong?
@cescobaz you'll need a more recent version of Java where the bug is fixed and make sure that xerces is not somehow still on the classpath
- java 1.8.0_72
- jwktl 1.0.1 from maven repository
- file to parse: enwiktionary-20160111-pages-articles.xml
running the code:
public static void main(String[] args) throws Exception {
File dumpFile = new File(PATH_TO_DUMP_FILE);
File outputDirectory = new File(TARGET_DIRECTORY);
boolean overwriteExisting = OVERWRITE_EXISTING_FILES;
JWKTL.parseWiktionaryDump(dumpFile, outputDirectory, overwriteExisting);
}
this is the output:
Jan 31, 2016 2:03:14 PM de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser onPageEnd
INFO: Parsed 2525000 pages
[WARNING]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
at java.lang.Thread.run(Thread.java:745)
Caused by: de.tudarmstadt.ukp.jwktl.api.WiktionaryException: XML parse error
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:140)
at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:74)
at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:143)
at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:117)
at it.unibo.mps.JWKTLConfigurator.main(JWKTLConfigurator.java:17)
... 6 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 92224978; columnNumber: 3; Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:131)
... 10 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
... 20 more
Compiling JWKTL from source and run the same code with the same file above, the output is:
Jan 31, 2016 1:34:54 PM de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser onPageEnd
INFO: Parsed 2525000 pages
[WARNING]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2048
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parseStream(XMLDumpParser.java:125)
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:116)
at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:78)
at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:143)
at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:117)
at it.unibo.mps.JWKTLConfigurator.main(JWKTLConfigurator.java:17)
... 6 more
You still have a buggy version of xerces on the classpath, probably through a 3rd party dependency (shown by the org.apache.xerces.
package prefix in the stacktrace).
run mvn dependency:tree
to show a list of your dependencies and to see where xerces gets pulled in from.
In the meantime I'll check if I can change jwktl to prefer the bundled version instead.
can you try the latest master?
Ok, mvn dependency:tree
shows the following (I don't know how to remove xerces):
[INFO] \- de.tudarmstadt.ukp.jwktl:jwktl:jar:1.0.1:compile
[INFO] +- com.sleepycat:je:jar:5.0.73:compile
[INFO] +- org.apache.ant:ant:jar:1.7.1:compile
[INFO] | \- org.apache.ant:ant-launcher:jar:1.7.1:compile
[INFO] \- xerces:xercesImpl:jar:2.11.0:compile
However I tried the latest master and it works!
Thanks!
I got the same error. What do you mean of latest master?
@hitzhoudi you'll need a recent snapshot version from
It works. Thanks!
@hitzhoudi good!
@chmeyer any plans to cut a new release? the changelog is getting quite big, with lots of valuable fixes like this one.
totally agree; will trigger the release soon. No show-stoppers at the moment, I guess?
not that I know of; the 1.1.0 milestone contains 2 issues which don't seem to be critical (switching package names - what should it be changed to?)
"Now or never..." - release is complete, so hopefully this issue can be entirely closed now.
Just a quick note, the 20160601 dump failed with this error on JDK8u45 – but upgrading to u92 fixed it.