linkedin/goavro

Unions with logicalTypes json encoded incorrectly

kklipsch opened this issue · 10 comments

The spec states "A logical type is always serialized using its underlying Avro type so that values are encoded in exactly the same way as the equivalent Avro type that does not have a logicalType attribute."

But goavro encodes the name of unions as "type.logicalType". The codec will not accept a record where the union is typed as anything else and it outputs this directly to json but that is incorrect. Binary encoding appears correct.

func TestNullableLogicalJSON(t *testing.T) {
		schema := `{
		"type": "record",
		"name": "nester",
		"namespace": "test",
		"fields" : [
			{"name": "a", "type": ["null", {"type": "long", "logicalType":"timestamp-millis"}], "default": null}
		]
	}`

	codec, err := goavro.NewCodec(schema)
	require.NoError(t, err)

	bs, err := codec.TextualFromNative(nil, map[string]interface{}{
		"a": goavro.Union("long.timestamp-millis", time.Date(2006, 1, 2, 15, 04, 05, 565000000, time.UTC)),
	})
	require.NoError(t, err)

	/*
	This is the bug.  This is how the reference implementation encodes the timestamp in a nullable union.
	Which matches the spec that states:
	"A logical type is always serialized using its underlying Avro type so that values are encoded in exactly the same way as the equivalent Avro type that does not have a logicalType attribute."
	
	goavro encodes this as {"a":{"long.timestamp-millis":1136214245565}}
	*/
	require.Equal(t, `{"a":{"long":1136214245565}}`, string(bs))

I am also looking at logical types and comparing Goavro to the Java implementation and this code that I managed to put together seems to confirm your findings:

// > export JAVA_HOME=/usr/local/opt/openjdk
// > export PATH="${JAVA_HOME}/bin:$PATH"
// > java --version
// openjdk 18.0.1 2022-04-19
// OpenJDK Runtime Environment Homebrew (build 18.0.1+0)
// OpenJDK 64-Bit Server VM Homebrew (build 18.0.1+0, mixed mode, sharing)
// > java -cp avro_1.11/avro-1.11.0.jar:avro_1.11/jackson-core-2.12.5.jar:avro_1.11/jackson-annotations-2.12.5.jar:avro_1.11/jackson-databind-2.12.5.jar:avro_1.11/slf4j-api-1.7.32.jar Main.java

import java.math.BigInteger;
import java.util.Arrays;
import java.io.*;
import org.apache.avro.Schema;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.*;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.Conversions;
import org.apache.avro.file.DataFileReader;
import java.util.HexFormat;

public class Main {
    static byte[] fromJsonToAvro(String json, Schema schema) throws Exception {
        InputStream input = new ByteArrayInputStream(json.getBytes());
        DataInputStream din = new DataInputStream(input);

        Decoder decoder = DecoderFactory.get().jsonDecoder(schema, din);

        DatumReader<Object> reader = new GenericDatumReader<Object>(schema);
        Object datum = reader.read(null, decoder);

        GenericDatumWriter<Object>  w = new GenericDatumWriter<Object>(schema);
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();

        Encoder e = EncoderFactory.get().binaryEncoder(outputStream, null);

        w.write(datum, e);
        e.flush();

        return outputStream.toByteArray();
    }

    public static String avroToJson(Schema schema, byte[] avroBinary) throws IOException {
        DatumReader<Object> datumReader = new GenericDatumReader<>(schema);
        Decoder decoder = DecoderFactory.get().binaryDecoder(avroBinary, null);
        Object avroDatum = datumReader.read(null, decoder);

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DatumWriter<Object> writer = new GenericDatumWriter<>(schema);
        JsonEncoder encoder = EncoderFactory.get().jsonEncoder(schema, baos, false);
        writer.write(avroDatum, encoder);
        encoder.flush();
        baos.flush();
        return new String(baos.toByteArray());
    }

    static String schemaJSON = """
{
    "type": "record",
    "name": "nester",
    "namespace": "test",
    "fields" : [
        {"name": "a", "type": ["null", {"type": "long", "logicalType":"timestamp-millis"}], "default": null}
    ]
}
""";

    static String inputJSON = """
{"a":{"long":1136214245565}}
""";

    public static void main(String[] args) {
        try {
            Schema schema = new Schema.Parser().parse(schemaJSON);

            byte[] avroByteArray = fromJsonToAvro(inputJSON, schema);

            String outputJson = avroToJson(schema, avroByteArray);

            System.out.println(outputJson);
        }
        catch (Exception e) {
            System.out.println(e);
        }
    }
}

It outputs {"a":{"long":1136214245565}}. Also, if I change the inputJSON to {"a":{"long.timestamp-millis":1136214245565}}, it fails with org.apache.avro.AvroTypeException: Unknown union branch long.timestamp-millis, which makes sense given the output format...

Disclaimer: I'm not really experienced with Java and this is some copy/paste code from StackOverflow that I modified a bit....

Thanks for this issue. This will take some thinking over here. I'm not sure if this means the current implementation is simply broken, or if its just in violation of the spec (which can be regarded as a form of being broken too). Also I'm not sure is fixing this specific problem would cause breakage in the existing user base. It does sound like a risky fix. Please provide your comments if you know about those topics.

From my point of view this means that JSON encoding is broken. You won't be able to use json encoded with goavro with other implementations of avro.

It likely will break any use cases where a current user is a) using json encoding with logicalTypes b) reading and writing exclusively with goavro (because if they had another client in the mix it would already be broken).

Given how long this library has been around, a new major version should be released if this change is made. My understanding is that logical types are not used used by many people, so I think most users won't have any issues when upgrading to the new major version, which they'd need to do manually anyway. Adding and maintaining a new API for this seems more painful long term...