spotify/scio

Serialize BucketMetadata#HashType as String?

clairemcginty opened this issue · 0 comments

Currently, BucketMetadata is written to JSON including the HashFunction enum value of its HashType, i.e.:

{
  "numBuckets": 2,
  "numShards": 1,
  "keyClass": "java.lang.String",
  "hashType": "MURMUR3_32",
  "keyField":" user_id"
}

However, this requires that all SMB reads have the writer's HashFunction type available on the classpath in order to deserialize metadata.json back into a BucketMetadata. As a result, if we ever evolve the available HashFunctions and an SMB writer wants to use a new one, all of its consumers will have to upgrade to the latest Scio to be able to deserialize its BucketMetadata.

The reader only needs to know about HashFunction in order to assert that it's the same across all partitions/sources -- and we could use a String for that purpose. Should we future-proof the SMB API by serializing HashFunction explicitly as a String?

i.e.

public abstract class BucketMetadata<K1, K2, V> implements Serializable, HasDisplayData {
-   @JsonProperty private final HashType hashType;
+   @JsonProperty private final String hashType;

+ // Used by writer code
  public HashType getHashType() {
-   return hashType;
+   return HashType.valueOf(hashType);
  }

+ // Works mostly as-is
  boolean isCompatibleWith(BucketMetadata other) {
    return other != null
        // version 1 is backwards compatible with version 0
        && (this.version <= 1 && other.version <= 1)
-        && this.hashType == other.hashType
+        && this.hashType.equals(other.hashType)
        // This check should be redundant since power of two is checked in BucketMetadata
        // constructor, but it's cheap to double-check.
        && (Math.max(numBuckets, other.numBuckets) % Math.min(numBuckets, other.numBuckets) == 0);
  }
}

Since HashType is already written using its String representation there should be no cross-version compatibility issues associated with this change -- you'd still be able to read old metadata.json files