altoo-ag/akka-kryo-serialization

Adding a SubclassResolver?

Closed this issue · 6 comments

I'm using this library for Akka Persistence, and so far it's mostly going pretty well. But I've been hitting Issue #25 in spades.

The problem is, I want to be hard-assed about making sure all classes get properly registered -- I've built a lot of infrastructure into my system to assign Kryo IDs to the persistent classes, and I've turned on idstrategy="explicit" to check this.

That turns out to be really problematic with the standard Scala and Akka types, because the type you think you're working with is frequently not the actual class getting serialized. Whether you're talking about Map (as in Issue #25), or ActorRef, or whatever, the public-facing type is frequently a trait or abstract class, masking a whole hive of little special-case subclasses (often deep trees of them, most of them private) to deal with various particular situations or optimizations. Even something as apparently straightforward as ActorRef turns out to have several variants -- indeed, I haven't even managed to track down the common across-the-network version.

So if you're requiring explicit registration, the standard classes wind up being a disaster -- you go to serialize something straightforward like Map, get an error telling you that it doesn't know how to serialize Map$Map1, add a registration for that, and you're fine until the next little case comes around. (It turns out that Set has four tiny mini-classes, to optimize the special cases of sets of 1-4 items.)

To deal with this, I'm trying something fairly radical: adding a new SubclassResolver, a subclass of DefaultClassResolver. In cases of a registration miss, this searches up the Class hierarchy, to see if an ancestor Type is registered. If so, it adds a pointer from this Type to that Registration and continues.

In theory, this should allow a lot of your default Serializers, for things like Map and ActorRef, to work properly in the registration-required world. If, say, the serializer for Map is registered, and we try to serializer Map$Map1, it will automatically find the registration for Map, add a pointer to that, and move on. Since the serializer for Map is written generically, it shouldn't care which subclass was originally fed into it.

(Obviously, this needs to be used with care, and only applies in cases where the registered superclass knows enough to serialize and deserialize all of the subclasses. But this looks pretty common.)

Which leads to a different problem: the ClassResolver must be set when the Kryo instance is created, and that's happening inside the private KryoSerializer.getKryo() method. I have no way to override this in my project, so I'm left with no choice but to fork this library.

So: I think I have to implement this new SubclassResolver, and I have to do it in a fork of the library. Would you like me to do so in a way that it can get merged into the main line? I'd rather keep working with the official library rather than my own private fork, and I don't think this problem is confined to me. If this sounds like something you'd be willing to release officially, I'll do this right, and put it behind a config switch, so that others can opt into it.

I'd appreciate a thumbs up or down relatively soon here, before I get too deeply into it, since it's likely to affect exactly how I code it. Thanks...

Actually, what the heck -- it's not a lot of work to do this properly. I might as well follow your style rather than cheating; it's not much more work.

I've just checked in a first draft of the concept, including a couple of quick unit tests, which show off what I'm talking about. Notice how I'm now only registering collection.immutable.Map, and it works as expected, despite the fact that the types actually being serialized are actually subclasses of that. The SubclassResolver is automatically detecting that a supertype is registered, and adding pseudo-registrations for the subclasses as needed.

I've put in a few tweaks to minimize the performance impact of the changes; I think they should work decently well. The one complication is that this tool needs to be explicitly turned on (in KryoSerializer) before it begins to operate -- Kryo does a lot of registration lookups during register, and this isn't appropriate there. (It does mean that there's some danger is someone does additional registrations after the Kryo gets built -- I don't know if anybody ever does that.)

Note that I had to add a slight variant of the Map serializer to make this work -- since the registered type is abstract, the system can't instantiate it. Instead, we're using the default implementation of Map for all of these abstractly-registered subclasses. This is a cheat, and in some cases folks will want to do something different, but I suspect that this is going to produce the desired outcome for most use cases. And nothing about this approach prevents you from registering a more-precise implementation as well, if you care.

I think this fixes #25, and probably suffices for my needs. (We'll see -- that's the next step.) Please give it a look and think about whether we should try to incorporate this into the mainline, or whether I should publish my own fork for my purposes. Thanks...

romix commented

@jducoeur Thanks for raising this question!

I totally understand the overall direction and motivation. But I need some time to process it and to consider all consequences.

There is one issue with this proposal which concerns me. As far as I understand you want the serializer of the base class to be responsible for serialization of derived classes, if there is not explicit registration found for a derived class. The overall idea is appealing, but what about modularity? I'm afraid that this base class serializer may become very big and will consist of a spagetti code to handle all possible derived classes. What happens if there dozens of derives classes (cases)? What if there is a derived class which is not handled by the base class serializer? I'd like to hear your opinion about this.

I can understand the concern, and I don't think this mechanism should be used uncritically -- it requires a certain amount of care.

But in practice, this isn't true of any of the four cases I have so far -- immutable.Map, immutable.Set, ActorRef and ActorPath, all of which I am now using in live code. In each of these cases, the actual implementing class that gets passed to serialization can be one of a number of different options, often an obscure special case that optimizes in some particular way. And in none of these cases is that something you want to actually use for deserialization -- in all cases, deserialization should be handled by the companion object of the top-level trait, which is responsible for choosing the correct subclass for the situation.

In fact, these deserializers are all conspicuously simple. Indeed, in order to make this work, I had to create a slightly simpler version of your ScalaImmutableMapSerializer, that intentionally knows nothing whatsoever about the specific subclass of Map. I don't believe this is ever appropriate for a case where the serializer needs to know about the individual classes -- the entire point is that, in many cases, the choice of which class actually gets used is purely an implementation detail inside the library, which should never be touched outside of it. That's what got me thinking about this in the first place: trying to register for Kryo was the first time (in four years of full-time, hardcore Scala) I've ever had to even think about a HashTrieMap explicitly. And I didn't even know that Set1, Set2, Set3 and Set4 existed until I started trying to understand the serialization errors I was getting.

Map is a great example of this: there are a bunch of subclasses, but if you created it using simply immutable.Map, then you should never, ever, care about which one you're currently using. Set is an even better example: if you start with an empty immutable.Set, and add elements incrementally, it goes through six different implementation classes, but all of that is invisible to the application code.

(It's worth noting that this problem is far more severe if you're using idiomatic, immutable Scala. I would bet it rarely comes up in mutable code, but immutable data structures tend towards this model of many implementation variants of a single public trait.)

If this new mechanism was going to be the only way to do business, I'd be concerned, but it's strictly an opt-in superset, and I've tried to code it so that there is zero change unless you opt in. As I've coded it, it only gets used if you turn it on in config, and you can always use a traditional more-specific class when appropriate -- for example, I've added a unit test that registers both the abstract immutable.Map serializer for the general case, but also a ListMap serializer to deal with that subclass. That should tame any temptation towards spaghetti code -- in cases where a subclass needs different serialization, it should register a separate serializer.

So basically, the message I would put into the documentation is that this is an extra tool, to use when necessary (mainly if you are using idstrategy="explicit"); it isn't a panacea, but for certain class hierarchies it can make things much easier. Unless something changes, I'm definitely using it myself; I'd prefer to contribute it into the mainline, but that's up to you.

My fork is in pretty good shape at this point -- it's working and unit-tested, and my application is already using all four of the examples mentioned above; it seems to be working just as intended. I'd encourage you to take a look at the changes in the fork. If you're open to accepting it into the mainline, I'll add some documentation to the front page, and submit a proper PR...

romix commented

OK. I see. I've looked at your fork and I think I'd like to accept your changes. They seem to be useful.

Please submit a proper PR. It should include some documentation for the front-page.
Also, please remove the "Modified by" disclaimers from the files. We don't do it. Your name will be seen in the commit messages anyways, so your contribution will not be invisible.

Okay -- I'll try to get that done Monday; thanks.

(And yeah, I wasn't sure how to deal with the copyright statements. I spent a while trying to find any official guidance about what one should do when modifying an Apache 2 file, but couldn't find anything especially clear...)

looks like the pr got merged but the issue left behind...