gogo/protobuf

Performance issue with "deeply" nested structures

Closed this issue · 8 comments

Problem

We (kubernetes api-machinery folks) have found a performance issue by accident in protobuf serialization. The issue appears to be a quadratic complexity on the depth of objects, apparently because "Size()" is being called repeatedly. We've been lucky to discover this problem because of a new recursive structure, which while profiling, aggregated the cumulative time under the same function signature, quickly raising a red-flag. We do believe though that the same performance hit impacts serialization of all objects, but just happened to be distributed across many different types, hiding the problem when profiling.

Reproduce

Here are some steps to reproduce, a very simple proto file:

syntax = "proto2";

message test {
  map<string, test> tests = 1;
}

Generate marshaling code:

protoc --gofast_out=. *.proto

Some benchmark tests:

package test

import (
	"testing"
)

var result []byte

func generateWide(width int) *Test {
	test := Test{
		Tests: map[string]*Test{},
	}
	for i := 0; i < width; i++ {
		test.Tests[string(int('a')+i)] = nil
	}
	return &test
}

func generateDeep(depth int) *Test {
	test := Test{
		Tests: nil,
	}
	current := &test
	for i := 0; i < depth; i++ {
		current.Tests = map[string]*Test{
			string(int('a') + i): &Test{
				Tests: nil,
			},
		}
		current = current.Tests[string(int('a')+i)]
	}
	return &test
}

func BenchmarkWideProtobuf1(b *testing.B) {
	test := generateWide(1)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf2(b *testing.B) {
	test := generateWide(2)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf5(b *testing.B) {
	test := generateWide(5)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf10(b *testing.B) {
	test := generateWide(10)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf20(b *testing.B) {
	test := generateWide(20)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf50(b *testing.B) {
	test := generateWide(50)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf100(b *testing.B) {
	test := generateWide(100)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf1(b *testing.B) {
	test := generateDeep(1)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf2(b *testing.B) {
	test := generateDeep(2)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf5(b *testing.B) {
	test := generateDeep(5)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf10(b *testing.B) {
	test := generateDeep(10)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf20(b *testing.B) {
	test := generateDeep(20)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf50(b *testing.B) {
	test := generateDeep(50)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf100(b *testing.B) {
	test := generateDeep(100)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

Produces the following results:

> go test -bench=.
goos: linux
goarch: amd64
BenchmarkWideProtobuf1-12      	10000000	       144 ns/op
BenchmarkWideProtobuf2-12      	10000000	       207 ns/op
BenchmarkWideProtobuf5-12      	 3000000	       406 ns/op
BenchmarkWideProtobuf10-12     	 2000000	       762 ns/op
BenchmarkWideProtobuf20-12     	 1000000	      1408 ns/op
BenchmarkWideProtobuf50-12     	  500000	      3569 ns/op
BenchmarkWideProtobuf100-12    	  200000	      7289 ns/op
BenchmarkDeepProtobuf1-12      	10000000	       175 ns/op
BenchmarkDeepProtobuf2-12      	 3000000	       446 ns/op
BenchmarkDeepProtobuf5-12      	  500000	      2012 ns/op
BenchmarkDeepProtobuf10-12     	  200000	      6477 ns/op
BenchmarkDeepProtobuf20-12     	   50000	     23164 ns/op
BenchmarkDeepProtobuf50-12     	   10000	    144829 ns/op
BenchmarkDeepProtobuf100-12    	    2000	    592502 ns/op
PASS
ok  	_/usr/local/google/home/apelisse/code/proto-test	22.916s

the quadratic complexity is very obvious here.

Solutions

Since sub-field sizes in protobufs are varints, and to avoid copying objects, memoization would probably be the best solution (I could think of). I did notice that protoc generates a field name XXX_cachesize that could definitely be used, though ideally we'd want per-serialization memoization rather than per-object. Also, we are generating code from go-types rather than .proto (AFAICT), which means that we can't easily (nor want to) embed extra fields in our own types.

Issue is tracked in kubernetes in kubernetes/kubernetes#76219.

Thanks for raising the issue. I will take a look.

We've noticed that marshaling objects backwards actually does fix the problem. Since we can marshal the data, then see how much space that took, and write that value with varint. It requires all sub-types to also be "reverse-marshalable".

We're suggesting the creation of a new interface:

type Relahsram struct {
   // Marshals backwards.
   LahsramOt(d []byte) (int, error)
}

Then, MarshalTo could be implemented as follows:

(m *RandomType) MarshalTo(d []byte) (int, error) {
  return m.LahrsamOt(d[:m.Size()])
}

We have a sort of toy implementation that we've tried, based on the reproduce steps above:

func (m *Test) Marshal() (dAtA []byte, err error) {
	size := m.Size()
	dAtA = make([]byte, size)
	n, err := m.LahrsamOt(dAtA[:size])
	if err != nil {
		return nil, err
	}
	return dAtA[:n], nil
}

func (m *Test) MarshalTo(dAtA []byte) (int, error) {
	return m.LahrsamOt(dAtA[:m.Size()])
}

func (m *Test) LahrsamOt(dAtA []byte) (int, error) {
	i := len(dAtA)
	_ = i
	var l int
	_ = l
	if len(m.Tests) > 0 {
		for k, _ := range m.Tests {
			v := m.Tests[k]
			origI := i
			if v != nil {
				n1, err1 := v.LahrsamOt(dAtA[:i])
				if err1 != nil {
					return 0, err1
				}
				i -= n1
				i = encodeVarintTest(dAtA, i, uint64(n1))
				i--
				dAtA[i] = 0x12
			}
			i -= len(k)
			copy(dAtA[i:], k)
			i = encodeVarintTest(dAtA, i, uint64(len(k)))
			i--
			dAtA[i] = 0xa
			i = encodeVarintTest(dAtA, i, uint64(origI-i))
			i--
			dAtA[i] = 0xa
		}
	}
	return len(dAtA) - i, nil
}

func encodeVarintTest(dAtA []byte, offset int, v uint64) int {
	offset -= sovTest(v)
	base := offset
	for v >= 1<<7 {
		dAtA[offset] = uint8(v&0x7f | 0x80)
		v >>= 7
		offset++
	}
	dAtA[offset] = uint8(v)
	return base
}

Obviously we would have to change the generation code in many different ways. @jennybuckley and I have started working on that, but I thought it might be good to keep you updated since it's a pretty invasive change.

We do get encouraging results running the benchmarks again though:

> go test -bench=.
goos: linux
goarch: amd64
BenchmarkWideProtobuf1-12      	 5000000	       255 ns/op
BenchmarkWideProtobuf2-12      	 5000000	       352 ns/op
BenchmarkWideProtobuf5-12      	 2000000	       694 ns/op
BenchmarkWideProtobuf10-12     	 1000000	      1250 ns/op
BenchmarkWideProtobuf20-12     	  500000	      2337 ns/op
BenchmarkWideProtobuf50-12     	  300000	      5840 ns/op
BenchmarkWideProtobuf100-12    	  100000	     11937 ns/op
BenchmarkDeepProtobuf1-12      	 5000000	       245 ns/op
BenchmarkDeepProtobuf2-12      	 3000000	       468 ns/op
BenchmarkDeepProtobuf5-12      	 1000000	      1113 ns/op
BenchmarkDeepProtobuf10-12     	  500000	      2098 ns/op
BenchmarkDeepProtobuf20-12     	  300000	      4156 ns/op
BenchmarkDeepProtobuf50-12     	  200000	     10578 ns/op
BenchmarkDeepProtobuf100-12    	  100000	     22369 ns/op
PASS
ok  	_/usr/local/google/home/apelisse/code/proto-test	23.010s

What do you think?

the reverse marshaling looks cool.
I did try a size caching approach last night. Not sure if I missed something but here are the changes:
CachedSize method

And then I updated the Size() method to call the cachedSize with dirty = true ( in order to recalculate the size on remarshal.)
dirty call in Size()
updated the Size() call in MarshalTo to call CachedSize with dirty = false
MarshalTo1
MarshalTo2

Though it might not be the best approach if someone uses MarshalTo without calling Size() first.
Caching the size could be an explicit gogo option, which could enforce a stricter way of marshaling your message.

I get this result compare to your initial benchmark tests running with: go test -bench=. -benchtime=10s

benchmark                       old ns/op     new ns/op     delta
BenchmarkWideProtobuf1-10       157           163           +3.82%
BenchmarkWideProtobuf2-10       244           257           +5.33%
BenchmarkWideProtobuf5-10       479           501           +4.59%
BenchmarkWideProtobuf10-10      1065          1040          -2.35%
BenchmarkWideProtobuf20-10      2072          1984          -4.25%
BenchmarkWideProtobuf50-10      5211          5072          -2.67%
BenchmarkWideProtobuf100-10     10173         10138         -0.34%
BenchmarkDeepProtobuf1-10       204           224           +9.80%
BenchmarkDeepProtobuf2-10       473           427           -9.73%
BenchmarkDeepProtobuf5-10       1981          993           -49.87%
BenchmarkDeepProtobuf10-10      6074          1882          -69.02%
BenchmarkDeepProtobuf20-10      20944         3261          -84.43%
BenchmarkDeepProtobuf50-10      125053        8410          -93.27%
BenchmarkDeepProtobuf100-10     481926        17017         -96.47%

It seems like it almost matches your reverse marshal benchmark.

I have mostly two concerns with this approach:

  1. Is this going to work well if you're serializing the same object in parallel? "Marshal" is typically considered a read-only operations, so people don't necessarily have locks on the objects if it's meant only for "reading". Adding this field in the struct makes it read/write.
  2. Because of the way kubernetes goes from Go types -> proto -> generated go types (and we drop the new generated go structure and only keep the methods), it means that we would have to embed this new field in all of our struct?

On the other hand, this change is much simpler ...

Adding this field in the struct makes it read/write.

This is true. This will probably invalidate my approach even if it is simpler.

it means that we would have to embed this new field in all of our struct?
Yeh, my change is quite invasive, it implies a field in the message structure. I was just playing around.

We can perhaps add the reverse marshaling as a message/file option. We might just have to add functionality to consider any XXX_unrecognized data if it is present.

We might just have to add functionality to consider any XXX_unrecognized data if it is present.

Yeah, we just dropped that because we didn't need it :-)

We have a 1kloc patch almost ready for plugin/marshalto/marshalto.go that does all operations backwards. We're working on passing the tests.

Thanks for the update.