optimaize/language-detector

could not initialize class com.optimaize.langdetect.profiles.BuiltInLanguages when running on spark

ekapratama93 opened this issue · 4 comments

I get error java.lang.NoClassDefFoundError:could not initialize class com.optimaize.langdetect.profiles.BuiltInLanguages when using language-detector in spark. I'm using suggested method to load the profile.
List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn(); LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard()).withProfiles(languageProfiles).build();

here is some stacktrace :
16/11/30 17:36:05 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,ExceptionFailure(java.lang.NoClass$ efFoundError,Could not initialize class com.optimaize.langdetect.profiles.BuiltInLanguages,[Ljava.lang.StackTraceElement;@4a5ae036,java.lang.NoClassDefFoundError: C$ uld not initialize class com.optimaize.langdetect.profiles.BuiltInLanguages at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltIn(LanguageProfileReader.java:118) at com.ebdesk.ph.nlp_sentence.TwitterWord2Vec$1.call(TwitterWord2Vec.java:86) at com.ebdesk.ph.nlp_sentence.TwitterWord2Vec$1.call(TwitterWord2Vec.java:1) at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1028) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

A bit of a late help for anybody that might have the same problem and find this (as I did), based on everything I've been able to found and which I hope is helpful enough to anybody with this problem, since I have found quite a few people with this problem, but not so many complete solutions. As it's my first time with Spark and I was also suffering with the not-serializables tasks, trying to instantiate things in the workers themselves and a lot more, it's wasn't easy to pinpoint that this was the real problem, since it could have been a lot of things.

Be aware that I'm not expert at Spark by no means though.

  • The NoClassDefFoundError error appears when you have a class in compilation time which is not found at runtime. In this case, that class is com.optimaize.langdetect.profiles.BuiltInLanguages.

  • LanguageProfileReader.readAllBuiltIn uses BuiltInLangues, which is initialized with a static block like this:

static {
    List<LdLocale> names = new ArrayList<>();
    names.add(LdLocale.fromString("af"));
    names.add(LdLocale.fromString("an"));
    [...]
}

and LdLocale.fromString(String) does something like

[...]
List<String> strings = Splitter.on('-').splitToList(string);
[...]
  • com.google.common.base.Splitter is part of the Guava library, and its splitToList(String) was added in version 18.0, if I'm not mistaken. The spark-core_2.10 has a dependency on guava 11.0.2 (coming from hadoop-client:2.2.0 -> hadoop-common 2.2.0), which doesn't have the splitToList(String) method, so fromString fails, BuiltInLanguages initialization fails, etc. if you try to use that old version of Guava.

  • Of course, this can be solved easily by just telling Maven (or whatever you're using) to use the version 18.0 (or a later one) of Guava.

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>19.0</version>
        </dependency>
    </dependencies>
</dependencyManagement>
  • This works fine if you run the Spark driver program on local (new SparkConf().setMaster("local[*]") and such), but, for some reason, if you run it at a real Spark cluster (new SparkConf().setMaster("spark://master_name:7077")) it seems to ignore the 18.0 version and use an old one which doesn't have the splitToList(String) method, and therefore it crashes (at least on standalone mode, I haven't tried Mesos or Yarn).

  • Because of this, the static block of BuiltInLanguages used for initialization fails, but the initial error of not finding the splitToList(String) method that is being thrown around doesn't reach the driver (this is the part where I'm not exactly sure of what happens and why), since it happens in the workers (although it should be visible in the stderr log of the worker, if I'm not mistaken).

  • So, in the end, the initialization of BuiltInLanguages fails, but the driver program is unaware of that. And then, when you try to do something which uses BuiltInLanguages... surprise, it doesn't exist. But it was fine at compilation time, hence the NoClassDefFoundError.

  • I tried to force Spark to use my Guava version by including it in my megajar, adding it when calling spark-submit with "--jars", with additional classpaths, telling Spark to give preference to my classpaths, etc. and nothing worked. In the end, I solved the problem by erasing it: I downloaded the whole language detector from GitHub, replaced the splitToList(String) with something that didn't use Guava and compiled it as a new dependency for my project.

  • At the same time I decided to modify the detector, I found a fork of the project made by people with a similar problem.

The code they used:

//List<String> strings = Splitter.on('-').splitToList(string);
List<String> strings = new ArrayList<String>();
String[] stringParts = string.split("-");
for (String stringpart: stringParts){
    strings.add(stringpart);
}

The repository with the code:
netarchivesuite@57ba6ed

The Jira issue where I found the repository:
https://sbforge.org/jira/browse/WEBDAN-86?focusedCommentId=31306&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-31306

adding
--conf "spark.executor.userClassPathFirst=true"
to spark-submit makes spark to load user jars first
and then you can bundle newer guava into your spark job's jar

I tried the "userClassPathFirst" strategy (Spark 2.3.1), but unfortunately adding that config seemed to bork something else unrelated in Spark. Possibly Spark depends on the behavior of the older version of Guava and going up to Guava 19 makes it blow up? Hard to tell.

The GitHub fork in the comment by @DanielGSM has a jar file that can be dropped into projects to solve this. Probably the easiest solution.