Serialize BucketMetadata#HashType as String?
clairemcginty opened this issue · 0 comments
Currently, BucketMetadata is written to JSON including the HashFunction enum value of its HashType, i.e.:
{
"numBuckets": 2,
"numShards": 1,
"keyClass": "java.lang.String",
"hashType": "MURMUR3_32",
"keyField":" user_id"
}
However, this requires that all SMB reads have the writer's HashFunction type available on the classpath in order to deserialize metadata.json
back into a BucketMetadata
. As a result, if we ever evolve the available HashFunctions and an SMB writer wants to use a new one, all of its consumers will have to upgrade to the latest Scio to be able to deserialize its BucketMetadata
.
The reader only needs to know about HashFunction in order to assert that it's the same across all partitions/sources -- and we could use a String for that purpose. Should we future-proof the SMB API by serializing HashFunction explicitly as a String?
i.e.
public abstract class BucketMetadata<K1, K2, V> implements Serializable, HasDisplayData {
- @JsonProperty private final HashType hashType;
+ @JsonProperty private final String hashType;
+ // Used by writer code
public HashType getHashType() {
- return hashType;
+ return HashType.valueOf(hashType);
}
+ // Works mostly as-is
boolean isCompatibleWith(BucketMetadata other) {
return other != null
// version 1 is backwards compatible with version 0
&& (this.version <= 1 && other.version <= 1)
- && this.hashType == other.hashType
+ && this.hashType.equals(other.hashType)
// This check should be redundant since power of two is checked in BucketMetadata
// constructor, but it's cheap to double-check.
&& (Math.max(numBuckets, other.numBuckets) % Math.min(numBuckets, other.numBuckets) == 0);
}
}
Since HashType is already written using its String representation there should be no cross-version compatibility issues associated with this change -- you'd still be able to read old metadata.json files