Support for Kryo serialization

Question

Support for Kryo serialization

ghislainfourny opened this issue 5 years ago · 10 comments

It seems that Laurelin does not support Kryo serialization. The following error is thrown:

com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException

It would be nice to add Kryo support, as this makes Spark querying considerably faster.

(it is activated with --conf spark.serializer=org.apache.spark.serializer.KryoSerializer on spark-submit).

Answer 1 · 2020-03-19T15:38:46.000Z

Hmm, I'll have to take a look into what I need to support that. I presume I need to just implement an interface or something. Thanks for the report!

…

Answer 2 · 2020-03-23T09:35:47.000Z

Thanks a lot, Andrew!

Answer 3 · 2020-03-27T02:25:29.000Z

So, this actually is more complicated than I had hoped .. the "typical" way to register classes in Kryo (from the user side) is to add the necessary classes to the SparkConf. But the datasource isn't initialized until after the SparkContext is created, and after the SparkContext is initialized, changes to the config don't take effect. I've asked the spark list for clarification, and I'll look into it some more.

Answer 4 · 2020-03-27T09:04:27.000Z

Thanks a lot, Andrew! For now we are deactivating Kryo and use the Java serializer instead whenever we read ROOT files, which works fine (just slower). So this does not block us.

Answer 5 · 2020-03-27T13:15:11.000Z

Hi Ghislain, It actually turns out that something is actually unserializable -- if I manually add some options for spark.kryo.classesToRegister, it still fails. I added a testcase to round-trip the offending manually kryo, but that doesn't trigger the error. (I also misread the kryo docs, you don't have to manually register every class, it just improves performance to do so)

…

Answer 6 · 2020-03-30T16:27:37.000Z

@ghislainfourny Hi! I think the above PR (#85) fixes the Kryo serialization problem. Or, at least, the test I wrote fails without the changes and succeeds with them. Would you be willing to test and let me know if it works w/your use-case? I assume you don't set spark.kryo.registrationRequired to true, right?

Answer 7 · 2020-03-31T08:27:55.000Z

Hi Andrew, thanks a lot! I'll give it a try -- although we are using Laurelin with a maven dependency, so I am not sure how easy it is to point our project to a Laurelin checkout instead. If I am blocked I will let you know.

Answer 8 · 2020-03-31T15:40:06.000Z

Okay -- I'd prefer to not cut a release with a half-fix, but I honestly don't know the best way to publish an "intermediate artifact" in Maven. If you don't have a good idea either, I'll make a release with this branch

Answer 9 · 2020-04-03T15:01:21.000Z

Dear Andrew,

My pleasure to confirm that it works -- with Kryo (see below). Thanks a lot!

Kind regards,
Ghislain

$ spark-submit target/spark-rumble-1.5-jar-with-dependencies.jar --shell yes --show-error-info yes
20/04/03 16:58:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    ____                  __    __   
   / __ \__  ______ ___  / /_  / /__ 
  / /_/ / / / / __ `__ \/ __ \/ / _ \  The distributed JSONiq engine
 / _, _/ /_/ / / / / / / /_/ / /  __/  1.5 "Southern Live Oak" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/


Master: local[*]
Item Display Limit: -
Output Path: -
Log Path: -
Query Path : -

rumble$ root-file("../laurelin/./testdata/uproot-small-flat-tree.root", "tree")
>>> 
>>> 
{ "Int32" : 0, "Int64" : 0, "UInt32" : 0, "UInt64" : 0, "Float32" : 0, "Float64" : 0, "ArrayInt32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayInt64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayUInt32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayUInt64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayFloat32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayFloat64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "N" : 0, "SliceInt32" : [ ], "SliceInt64" : [ ], "SliceUInt32" : [ ], "SliceUInt64" : [ ], "SliceFloat32" : [ ], "SliceFloat64" : [ ] }
{ "Int32" : 1, "Int64" : 1, "UInt32" : 1, "UInt64" : 1, "Float32" : 1, "Float64" : 1, "ArrayInt32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayInt64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayUInt32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayUInt64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayFloat32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayFloat64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "N" : 1, "SliceInt32" : [ 1 ], "SliceInt64" : [ 1 ], "SliceUInt32" : [ 1 ], "SliceUInt64" : [ 1 ], "SliceFloat32" : [ 1 ], "SliceFloat64" : [ 1 ] }
...

Answer 10 · 2020-04-03T16:09:28.000Z

Very nice! I'll cut a release shortly.

…