tdunning/t-digest

Future-proof serialization

tdunning opened this issue · 24 comments

Currently, we depend on Java's serialization which is not a good strategy.

It would be much better to do our own and make all the different kinds of Digest use interchangeable formats.

You can overload Java serialization process to save what you want, for example in my example, methods getBytes and fromBytes already implements in your Digest classes.

In this example, fields of object are ignored and only declared "digestEncodeWithBytes" field is declared :

public abstract class TDigest implements Serializable {

    // For serialisation, overload ObjectInput/OutputStream reflection
    private static final ObjectStreamField[] serialPersistentFields = {
        new ObjectStreamField("digestEncodeWithBytes", byte[].class)
    };

    // Or use getBytes(ByteBuffer) instead
    public abstract byte[] getBytes();

    // Serialization create the object which implements this abstract class, so call a method to init it
    protected abstract void initWithBytes(byte[] b);

    /**
     * Overload {@link ObjectOutputStream#defaultWriteObject()}
     *
     * @param oos object output stream use for write
     * @throws IOException error during write process
     */
    private void writeObject(ObjectOutputStream oos) throws IOException {
        oos.putFields().put("digestEncodeWithBytes", getBytes());
        oos.writeFields();
    }

    /**
     * Overload {@link ObjectInputStream#defaultReadObject()}
     *
     * @param ois object input stream use for read
     * @throws IOException error during read process
     * @throws ClassNotFoundException class use in serialised stream unknown
     */
    private void readObject(ObjectInputStream ois) throws IOException, ClassNotFoundException {
        Object o = ois.readFields().get("digestEncodeWithBytes", null);
        if (o != null) {
            byte[] b = (byte[]) o;
            initWithBytes(b);
        }
    }
}

You can also overwrite deserialisation with : link

class HackedObjectInputStream extends ObjectInputStream {

    public HackedObjectInputStream(InputStream in) throws IOException {
        super(in);
    }

    @Override
    protected ObjectStreamClass readClassDescriptor() throws IOException, ClassNotFoundException {
        ObjectStreamClass resultClassDescriptor = super.readClassDescriptor();

        if (resultClassDescriptor.getName().equals("oldpackage.Clazz"))
            resultClassDescriptor = ObjectStreamClass.lookup(newpackage.Clazz.class);

        return resultClassDescriptor;
    }
}

Isn't it possible to add a Transcoder interface which has byte[] asBytes(SerializationModel) and SerializationModel fromBytes(byte[]) methods? SerializationModel here being a java object that is good enough to represent any digest. TDigest interface can now have a create method that takes SerializationModel as input which all digests implement. Basically taking out byte-level concerns out of the digest impls.

You can provide a default transcoder which does serialization similar to how it's done today (with bugfixes), while keeping the flexibility of accepting PRs from others that implement transcoders for other formats. If this line of thought makes sense to you and more importantly if we can come up with a common model to represent all digests, I will be happy to contribute a PR!

A little background: we have a project that does jvm profiling for clusters and all our components talk Protobuf. We deal with latencies, scheduler delays and other metrics and streaming quantiles make a lot of sense. Hence the interest in adopting tdigest but it does not have protobuf support today nor the design is friendly enough to submit a PR for the same

No. Basically, I wanted to enable bring your own serialization logic with some sane ones provided by the lib. By SerializationModel, I am implying one java class something on the lines of:

class SerializationModel {
    int encodingType; double min; double max; float compression; // ... and other members
}

class MergingDigest {
    // ...
    static MergingDigest fromSerializationModel(SerializationModel);
}

interface Transcoder {
    byte[] asBytes(SerializationModel);
    SerializationModel fromBytes(byte[]);
}

class ProtobufTranscoder implements Transcoder {
    // ...
}

This keeps the byte magic out of digest implementations and they are responsible for generating a uniform on-heap representation (which is SerializationModel, I know this name sucks)
Users can instantiate any out of the box transcoder or write their own. A way to maintain backward compatibility would be for MergingDigest, etc to optionally take a transcoder as constructor param and use that (or default one if not provided) to keep existing asBytes, fromBytes method working.

If you have any other ideas on how tdigest can support multiple serialization formats (and hopefully you consider that as a feature to be added in the first place), would love to hear about them.

Or DigestModel? I see this as a dto representing digests.
Sure, I have some free time next week so expect a PR then.

You mention interchangeability of digest implementations. Presently, MergingDigest and AvlTreeDigest serialize different values so their serialized representation is not equivalent. Will need your inputs on structure of DigestModel and what fields of MergingDigest/AvlTreeDigest map to DigestModel fields.
The straightforward ones are min, max, compression, mean[]. But rest differ across impls.

@tdunning Need your inputs on what DigestModel looks like. As mentioned in previous comment, presently different fields are serialized across digest implementations

Apart from bufferSize, MergingDigest serializes lastUsedCell. For verbose encoding, this is deserialized and set as a property in digest. For small encoding, lastUsedCell additionally represents the values to read and assign in mean/weight array. Rest of the array seems to be zero-filled since n(array size) is part of serde as well.

Also, oddly MergingDigest#asBytes serializes tempMean.length but MergingDigest#fromBytes (verbose) does not expect it to be present in buffer. Is it a known issue or am I missing something there?

@anvinjain Do you have a PR coming?

We need to cut the new major release. If you can't get this in soon, it will have to wait.

This is being pushed beyond the 3.2 release.

@tdunning I have raised a PR around this.

@tdunning Did you get a chance to look at the PR?

@tdunning Ping. Is there a timeline for next release? If you don't see this PR #99 making the cut without major changes, let me know.

@tdunning looks like the PR was never merged. This is of interest to us too, so if there are any review comments we can take a look at addressing them too. Thanks in advance.

Here's my proposal: #162

  • Serializations can be configured via serialization.properties file.
  • The serialization gets instatniated with a java.util.Properties object containing all properties form the serialization.properties file
  • It can load custom implementations via classForName as long as they have been added to the classpath and configured in the serialization.properties file. It's up to the implementation what to do with those properties.

Example for serialization.properties file

tdigest.serializerClass=org.example.CustomSerializer
someProperty=someValue
someOtherProperty=someOtherValue

This is the current implementation of TDigestDefaultSerializer

public class TDigestDefaultSerializer extends AbstractTDigestSerializer<TDigest, byte[]> {
    
    public TDigestDefaultSerializer(Properties properties) {
        super(properties);
    }

    @Override
    public byte[] serialize(TDigest tDigest) throws TDigestSerializerException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream(5120);
        try (ObjectOutputStream out = new ObjectOutputStream(baos)) {
            out.writeObject(tDigest);
            return baos.toByteArray();
        } catch (IOException e) {
            throw new TDigestSerializerException(e);
        } 
    }

    @Override
    public TDigest deserialize(byte[] object) throws TDigestSerializerException {
        try (ObjectInputStream in = new ObjectInputStream(new ByteArrayInputStream(object))) {
            return (TDigest) in.readObject();
        } catch (ClassCastException | ClassNotFoundException | IOException e) {
            throw new TDigestSerializerException(e);
        }
    }

}

This is the code for AbstractTDigestSerializer

public abstract class AbstractTDigestSerializer<T extends TDigest, O extends Object> implements TDigestSerializer<T, O> {
    private Properties properties;
    
    public AbstractTDigestSerializer(Properties properties) {
        this.properties = properties;
    }    
    
    public Properties getProperties() {
        return properties;
    }

}

Hope this is being seen as future proof.

There's still some work to do:

  • code formatting (I use 🌑Eclipse) and review (especially how I do generics 😅)
  • proper JavaDoc documentation in source code
  • additional tests for custom serialization implementations and property file handling