dkpro/dkpro-jwktl

XML parse error

Closed this issue · 16 comments

Originally reported on Google Code with ID 6

=> What steps will reproduce the problem?
1. Run the parsing data with following code
public static void main(String[] args) throws Exception {
    File dumpFile = new File(PATH_TO_DUMP_FILE);
    File outputDirectory = new File(TARGET_DIRECTORY);
    boolean overwriteExisting = OVERWRITE_EXISTING_FILES;

    JWKTL.parseWiktionaryDump(dumpFile, outputDirectory, overwriteExisting);

2. Using the 2 latest dump datafiles from Wiktionary
     enwiktionary-20140504-pages-articles.xml
     enwiktionary-latest-pages-articles.xml


=> What is the expected output? What do you see instead?
INFO: Parsed 775000 pages
Exception in thread "main" de.tudarmstadt.ukp.jwktl.api.WiktionaryException: XML parse
error
    at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:140)
    at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:74)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:143)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:117)
    at ParsingData.main(ParsingData.java:12)
Caused by: org.xml.sax.SAXParseException; lineNumber: 34869191; columnNumber: 5; Invalid
byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
    at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:131)
    ... 4 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
of 4-byte UTF-8 sequence.
    at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
    at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    ... 14 more

=> What version of the product are you using? On what operating system?
   <dependency>
     <groupId>de.tudarmstadt.ukp.jwktl</groupId>
     <artifactId>jwktl</artifactId>
     <version>1.0.0</version>
   </dependency>

=> Please provide any additional information below.
However, the parsing went through successfully with the dump file "enwiktionary-20140415-pages-articles.xml"

Reported by ngoc@fbk.eu on 2014-05-21 01:34:15

Seems like someone has typed a non-UTF-8 character into a Wiktionary article, which
hasn't be cleaned by the database dump application. That is, there is a 4 byte character
sequence in the latest.xml, which does not follow the expected UTF-8 sequence pattern
(11110xxx 10xxxxxx 10xxxxxx 10xxxxxx). Best solution is to remove these invalid characters.

A quick&dirty hack might be http://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file

A more elaborate idea is http://www.mkyong.com/java/sax-error-malformedbytesequenceexception-invalid-byte-1-of-1-byte-utf-8-sequence/

I don't know what helps and if there's an easy way of stripping non-UTF-8 characters
from the input file easily. Would be nice if you could report back and potentially
submit a patch, since I suspect that other users will run in the same issue.

Reported by chmeyer.de on 2014-05-21 10:03:53

  • Status changed: Accepted
  • Labels added: Component-Parser
Probably fixed with jwktl-1.0.1. Please try again using the new version. Note that AFAIK
the UTF-8 bug still exists, but it is either ignored by the current xerces version
or the current XML dump has been fixed w.r.t. this issue.

Reported by chmeyer.de on 2014-09-30 12:27:24

  • Status changed: Fixed

Hi, I tried 1.0.1 version, but it fails. I tried to remove non-utf-8 character, but it fails again.
I try to build version 1.0.2-SNAPSHOT from source and it fails with java.lang.ArrayIndexOutOfBoundsException: 2048 exception.

Am I doing something wrong?

@cescobaz you'll need a more recent version of Java where the bug is fixed and make sure that xerces is not somehow still on the classpath

  • java 1.8.0_72
  • jwktl 1.0.1 from maven repository
  • file to parse: enwiktionary-20160111-pages-articles.xml

running the code:

public static void main(String[] args) throws Exception {
    File dumpFile = new File(PATH_TO_DUMP_FILE);
    File outputDirectory = new File(TARGET_DIRECTORY);
    boolean overwriteExisting = OVERWRITE_EXISTING_FILES;

    JWKTL.parseWiktionaryDump(dumpFile, outputDirectory, overwriteExisting);
}

this is the output:

Jan 31, 2016 2:03:14 PM de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser onPageEnd
INFO: Parsed 2525000 pages
[WARNING] 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
    at java.lang.Thread.run(Thread.java:745)
Caused by: de.tudarmstadt.ukp.jwktl.api.WiktionaryException: XML parse error
    at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:140)
    at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:74)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:143)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:117)
    at it.unibo.mps.JWKTLConfigurator.main(JWKTLConfigurator.java:17)
    ... 6 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 92224978; columnNumber: 3; Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
    at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:131)
    ... 10 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
    at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    ... 20 more

Compiling JWKTL from source and run the same code with the same file above, the output is:

Jan 31, 2016 1:34:54 PM de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser onPageEnd
INFO: Parsed 2525000 pages
[WARNING] 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2048
    at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
    at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parseStream(XMLDumpParser.java:125)
    at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:116)
    at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:78)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:143)
    at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:117)
    at it.unibo.mps.JWKTLConfigurator.main(JWKTLConfigurator.java:17)
    ... 6 more

You still have a buggy version of xerces on the classpath, probably through a 3rd party dependency (shown by the org.apache.xerces. package prefix in the stacktrace).

run mvn dependency:tree to show a list of your dependencies and to see where xerces gets pulled in from.

In the meantime I'll check if I can change jwktl to prefer the bundled version instead.

can you try the latest master?

Ok, mvn dependency:tree shows the following (I don't know how to remove xerces):

[INFO] \- de.tudarmstadt.ukp.jwktl:jwktl:jar:1.0.1:compile
[INFO]    +- com.sleepycat:je:jar:5.0.73:compile
[INFO]    +- org.apache.ant:ant:jar:1.7.1:compile
[INFO]    |  \- org.apache.ant:ant-launcher:jar:1.7.1:compile
[INFO]    \- xerces:xercesImpl:jar:2.11.0:compile

However I tried the latest master and it works!
Thanks!

I got the same error. What do you mean of latest master?

It works. Thanks!

@hitzhoudi good!
@chmeyer any plans to cut a new release? the changelog is getting quite big, with lots of valuable fixes like this one.

totally agree; will trigger the release soon. No show-stoppers at the moment, I guess?

not that I know of; the 1.1.0 milestone contains 2 issues which don't seem to be critical (switching package names - what should it be changed to?)

"Now or never..." - release is complete, so hopefully this issue can be entirely closed now.

Just a quick note, the 20160601 dump failed with this error on JDK8u45 – but upgrading to u92 fixed it.