awakesecurity/proto3-wire

Inconsistency in protocol buffers specification?

Gabriella439 opened this issue · 4 comments

The protocol buffers encoding specification says:

Normally, an encoded message would never have more than one instance of a non-repeated field. However, parsers are expected to handle the case in which they do. For numeric types and strings, if the same field appears multiple times, the parser accepts the last value it sees. For embedded message fields, the parser merges multiple instances of the same field, as if with the Message::MergeFrom method – that is, all singular scalar fields in the latter instance replace those in the former, singular embedded messages are merged, and repeated fields are concatenated. The effect of these rules is that parsing the concatenation of two encoded messages produces exactly the same result as if you had parsed the two messages separately and merged the resulting objects. That is, this:

MyMessage message;
message.ParseFromString(str1 + str2);

is equivalent to this:

MyMessage message, message2;
message.ParseFromString(str1);
message2.ParseFromString(str2);
message.MergeFrom(message2);

Let's call this a monoid homomorphism law for ParseFromString

However, this seems to conflict with another protocol buffers design decision to omit default fields when serializing message. Specifically, the protocol buffers 3 language guide says:

When a message is parsed, if the encoded message does not contain a particular singular element, the corresponding field in the parsed object is set to the default value for that field.

Defaulting values breaks the above monoid homomorphism law if you consider the following message type:

message Example {
  int32 foo = 1;
}

... and then you serialize the following two messages:

  • message1 : an Example message with the field foo set to 1
  • message2: an Example message with the field foo set to 0

If you deserialize each ByteString independently and combine the resulting Messages you would get the field foo set to 0 (since the latter Message would default to foo = 0 and fields from the latter Message take precedence), but if you combined the ByteStrings before deserializing, you would get the field foo set to 1 (since the 0 field was never explicitly serialized)

Is there something that I'm missing or is the specification inconsistent?

I cannot tell from the documentation, but perhaps MergeFrom would take the 1 over the 0, even if the 0 comes second, simply because 0 is the default, and therefore equivalent in some sense to omission?

This is indeed confusing. I just tried it with the official Python implementation, and both cases result in foo = 1.

I believe the key is interpreting the phrase

if the same field appears multiple times, the parser accepts the last value it sees.

It seems that "appears" means "occurs explicitly in the serialized message". A field set to 0 never "appears" because it's set to a default value, and so it isn't present in the serialized message.

This seems like a bad idea (it's surprising that MergeFrom doesn't merge in defaults), but I guess that's the spec...

Here's some code:

syntax = "proto3";

package test;

message Example {
  int32 foo = 1;
}
import test_pb2

x = test_pb2.Example()
x.foo = 1
xStr = x.SerializeToString()

y = test_pb2.Example()
y.foo = 0
yStr = y.SerializeToString()

fromConcat = test_pb2.Example()
fromConcat.ParseFromString(xStr + yStr)

fromMerge = test_pb2.Example()
fromMerge.ParseFromString(xStr)
yParsed = test_pb2.Example()
yParsed.ParseFromString(yStr)
fromMerge.MergeFrom(yParsed)

print('fromConcat: ' + fromConcat.__str__())
print('fromMerge: ' + fromMerge.__str__())

That seems to be consistent with @j6carey's interpretation, which is that merging two scalar fields returns the last field with a non-default value. In other words:

newtype ProtobufInt32 = ProtobufInt { unProtobufInt :: Int32 }

instance Monoid ProtobufInt32 where
    mempty = ProtobufInt32 0

    mappend x (ProtobufInt32 0) = x
    mappend _  x                = x

... which is actually a law-abiding Monoid, albeit weird. That in turns preserves the monoid morphism law

Agreed. Keep in mind that since the required/optional distinction of proto2 went away, default values are often used semantically to indicate value absence. I think this is why they provide things like Int32Value (among many similar types) for when one wants to e.g. actually use zero as a meaningful value.

So the Monoid is indeed weird but it makes sense along the lines of "merge only the present values together, and when no values are present, use the default value." That's my $0.02, anyway.