databricks/spark-corenlp

protobuf dependency conflict between corenlp and spark

cfregly opened this issue · 1 comments

@mengxr it looks like spark is stuck on protobuf-java version 2.5.0 (https://github.com/apache/spark/blob/0a38637d05d2338503ecceacfb911a6da6d49538/pom.xml#L130) while corenlp has charged ahead with v 2.6.1.

how did you overcome this conflict?

here's the stack trace:

java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
    at edu.stanford.nlp.pipeline.CoreNLPProtos$Token$Builder.buildPartial(CoreNLPProtos.java:12243)
    at edu.stanford.nlp.pipeline.CoreNLPProtos$Token$Builder.build(CoreNLPProtos.java:12145)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:238)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProtoBuilder(ProtobufAnnotationSerializer.java:384)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:345)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProtoBuilder(ProtobufAnnotationSerializer.java:494)
    at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:456)
    at com.databricks.spark.corenlp.CoreNLP$$anonfun$1.apply(CoreNLP.scala:77)
    at com.databricks.spark.corenlp.CoreNLP$$anonfun$1.apply(CoreNLP.scala:73)  

btw, it looks like corenlp 3.6.0 is avaliable, but will be released to maven central sometime in january.

I waited for Spark 1.6.0 to give this another try.

I noticed this comment in the pom.xml at the root of the spark project:

   <!-- In theory we need not directly depend on protobuf since Spark does not directly
           use it. However, when building with Hadoop/YARN 2.2 Maven doesn't correctly bump
           the protobuf version up from the one Mesos gives. For now we include this variable
           to explicitly bump the version when building with YARN. It would be nice to figure
           out why Maven can't resolve this correctly (like SBT does). -->

     <dependency>
        <groupId>com.google.protobuf</groupId>
        <artifactId>protobuf-java</artifactId>
        <version>${protobuf.version}</version>
        <scope>${hadoop.deps.scope}</scope>
     </dependency>

So I just changed the <protobuf.version> in the main pom.xml to 2.6.1, built from source, and rolled the dice.

Oh, and one more thing that is likely related... I removed -Pkinesis-asl from my build command which seemed to depend on protobuf.

Here's the final build command:

export MAVEN_OPTS="-Xmx8g -XX:ReservedCodeCacheSize=512m" && ./make-distribution.sh --name fluxcapacitor --tgz --with-tachyon --skip-java-test -Phadoop-2.6 -Dhadoop.version=2.6.0 -Psparkr -Phive -Phive-thriftserver -Pspark-ganglia-lgpl -Pnetlib-lgpl -DskipTests

Seems to be working for now. Fingers crossed.