linkedin/isolation-forest

The library gives error while writing model using Spark 2.4

bhushanbalki opened this issue · 6 comments

First of all thanks for making the Isolation Forest library open source. We would like to use this library with Spark 2.4.0. We tried using this library with Spark 2.4 Job but it is giving the error related to json4s while writing the model to HDFS. The error is "Spark with json4s, parse function raise java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z
Caused by: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;"

We understand the breaking changes are because of Spark 2.4.0 which started using json4s version 3.5.3 while your library uses Spark 2.3 which uses json4s version 3.2.11.

We tried building the Isolation Forest library with Spark 2.4 but it is failing. Can you help us to make this library compatible with Spark 2.4.0? We understand we need to update the scala code. Can you help us with it ?

Thanks for the interest in the library!

I believe the 2.4 builds are failing because as of 2.4.0 Databricks donated their spark-avro library to Apache Spark.

https://github.com/databricks/spark-avro

This support is now built-in.

https://spark.apache.org/docs/2.4.0/sql-data-sources-avro.html

I was able to get the isolation-forest library to build successfully by changing the dependencies in the module-level build.gradle as follows:

dependencies {
    compile("com.chuusai:shapeless_2.11:2.3.2")
//    compile("com.databricks:spark-avro_2.11:4.0.0")
    compile("org.apache.spark:spark-avro_2.11:2.4.0")
    compile("org.apache.spark:spark-core_2.11:2.4.0")
    compile("org.apache.spark:spark-mllib_2.11:2.4.0")
    compile("org.apache.spark:spark-sql_2.11:2.4.0")
    compile("org.scalatest:scalatest_2.11:2.2.6")
    compile("org.testng:testng:6.8.8")
}

Please let me know if this works for you.

@bhushanbalki : Did this work for you?

Yes, it work. Thanks for your input

Hi! Thanks for this thread. I'm facing the same issue.
Just to be sure:
When commenting compile("com.databricks:spark-avro_2.11:4.0.0"), 5 unit tests fail, all of them with the error "org.apache.spark.sql.AnalysisException: Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;"

I guess this is fine since databricks contains a built-in version of com.databricks.spark.avro, is that correct?

@fabiofabris: You need to not only comment out the compile("com.databricks:spark-avro_2.11:4.0.0") dependency, but also add the compile("org.apache.spark:spark-avro_2.11:2.4.0" dependency. Please make sure your dependencies are as shown here: #1 (comment)

I followed the instructions mentioned in #1 still the error persists.