spark-root/laurelin

Support for Kryo serialization

ghislainfourny opened this issue · 10 comments

It seems that Laurelin does not support Kryo serialization. The following error is thrown:

com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException

It would be nice to add Kryo support, as this makes Spark querying considerably faster.

(it is activated with --conf spark.serializer=org.apache.spark.serializer.KryoSerializer on spark-submit).

Thanks a lot, Andrew!

So, this actually is more complicated than I had hoped .. the "typical" way to register classes in Kryo (from the user side) is to add the necessary classes to the SparkConf. But the datasource isn't initialized until after the SparkContext is created, and after the SparkContext is initialized, changes to the config don't take effect. I've asked the spark list for clarification, and I'll look into it some more.

Thanks a lot, Andrew! For now we are deactivating Kryo and use the Java serializer instead whenever we read ROOT files, which works fine (just slower). So this does not block us.

@ghislainfourny Hi! I think the above PR (#85) fixes the Kryo serialization problem. Or, at least, the test I wrote fails without the changes and succeeds with them. Would you be willing to test and let me know if it works w/your use-case? I assume you don't set spark.kryo.registrationRequired to true, right?

Hi Andrew, thanks a lot! I'll give it a try -- although we are using Laurelin with a maven dependency, so I am not sure how easy it is to point our project to a Laurelin checkout instead. If I am blocked I will let you know.

Okay -- I'd prefer to not cut a release with a half-fix, but I honestly don't know the best way to publish an "intermediate artifact" in Maven. If you don't have a good idea either, I'll make a release with this branch

Dear Andrew,

My pleasure to confirm that it works -- with Kryo (see below). Thanks a lot!

Kind regards,
Ghislain

$ spark-submit target/spark-rumble-1.5-jar-with-dependencies.jar --shell yes --show-error-info yes
20/04/03 16:58:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    ____                  __    __   
   / __ \__  ______ ___  / /_  / /__ 
  / /_/ / / / / __ `__ \/ __ \/ / _ \  The distributed JSONiq engine
 / _, _/ /_/ / / / / / / /_/ / /  __/  1.5 "Southern Live Oak" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/


Master: local[*]
Item Display Limit: -
Output Path: -
Log Path: -
Query Path : -

rumble$ root-file("../laurelin/./testdata/uproot-small-flat-tree.root", "tree")
>>> 
>>> 
{ "Int32" : 0, "Int64" : 0, "UInt32" : 0, "UInt64" : 0, "Float32" : 0, "Float64" : 0, "ArrayInt32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayInt64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayUInt32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayUInt64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayFloat32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayFloat64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "N" : 0, "SliceInt32" : [ ], "SliceInt64" : [ ], "SliceUInt32" : [ ], "SliceUInt64" : [ ], "SliceFloat32" : [ ], "SliceFloat64" : [ ] }
{ "Int32" : 1, "Int64" : 1, "UInt32" : 1, "UInt64" : 1, "Float32" : 1, "Float64" : 1, "ArrayInt32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayInt64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayUInt32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayUInt64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayFloat32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayFloat64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "N" : 1, "SliceInt32" : [ 1 ], "SliceInt64" : [ 1 ], "SliceUInt32" : [ 1 ], "SliceUInt64" : [ 1 ], "SliceFloat32" : [ 1 ], "SliceFloat64" : [ 1 ] }
...