Performance issue with "deeply" nested structures
Closed this issue · 8 comments
Problem
We (kubernetes api-machinery folks) have found a performance issue by accident in protobuf serialization. The issue appears to be a quadratic complexity on the depth of objects, apparently because "Size()" is being called repeatedly. We've been lucky to discover this problem because of a new recursive structure, which while profiling, aggregated the cumulative time under the same function signature, quickly raising a red-flag. We do believe though that the same performance hit impacts serialization of all objects, but just happened to be distributed across many different types, hiding the problem when profiling.
Reproduce
Here are some steps to reproduce, a very simple proto file:
syntax = "proto2";
message test {
map<string, test> tests = 1;
}
Generate marshaling code:
protoc --gofast_out=. *.proto
Some benchmark tests:
package test
import (
"testing"
)
var result []byte
func generateWide(width int) *Test {
test := Test{
Tests: map[string]*Test{},
}
for i := 0; i < width; i++ {
test.Tests[string(int('a')+i)] = nil
}
return &test
}
func generateDeep(depth int) *Test {
test := Test{
Tests: nil,
}
current := &test
for i := 0; i < depth; i++ {
current.Tests = map[string]*Test{
string(int('a') + i): &Test{
Tests: nil,
},
}
current = current.Tests[string(int('a')+i)]
}
return &test
}
func BenchmarkWideProtobuf1(b *testing.B) {
test := generateWide(1)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf2(b *testing.B) {
test := generateWide(2)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf5(b *testing.B) {
test := generateWide(5)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf10(b *testing.B) {
test := generateWide(10)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf20(b *testing.B) {
test := generateWide(20)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf50(b *testing.B) {
test := generateWide(50)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf100(b *testing.B) {
test := generateWide(100)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf1(b *testing.B) {
test := generateDeep(1)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf2(b *testing.B) {
test := generateDeep(2)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf5(b *testing.B) {
test := generateDeep(5)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf10(b *testing.B) {
test := generateDeep(10)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf20(b *testing.B) {
test := generateDeep(20)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf50(b *testing.B) {
test := generateDeep(50)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf100(b *testing.B) {
test := generateDeep(100)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
Produces the following results:
> go test -bench=.
goos: linux
goarch: amd64
BenchmarkWideProtobuf1-12 10000000 144 ns/op
BenchmarkWideProtobuf2-12 10000000 207 ns/op
BenchmarkWideProtobuf5-12 3000000 406 ns/op
BenchmarkWideProtobuf10-12 2000000 762 ns/op
BenchmarkWideProtobuf20-12 1000000 1408 ns/op
BenchmarkWideProtobuf50-12 500000 3569 ns/op
BenchmarkWideProtobuf100-12 200000 7289 ns/op
BenchmarkDeepProtobuf1-12 10000000 175 ns/op
BenchmarkDeepProtobuf2-12 3000000 446 ns/op
BenchmarkDeepProtobuf5-12 500000 2012 ns/op
BenchmarkDeepProtobuf10-12 200000 6477 ns/op
BenchmarkDeepProtobuf20-12 50000 23164 ns/op
BenchmarkDeepProtobuf50-12 10000 144829 ns/op
BenchmarkDeepProtobuf100-12 2000 592502 ns/op
PASS
ok _/usr/local/google/home/apelisse/code/proto-test 22.916s
the quadratic complexity is very obvious here.
Solutions
Since sub-field sizes in protobufs are varints, and to avoid copying objects, memoization would probably be the best solution (I could think of). I did notice that protoc
generates a field name XXX_cachesize
that could definitely be used, though ideally we'd want per-serialization memoization rather than per-object. Also, we are generating code from go-types rather than .proto
(AFAICT), which means that we can't easily (nor want to) embed extra fields in our own types.
Issue is tracked in kubernetes in kubernetes/kubernetes#76219.
Thanks for raising the issue. I will take a look.
We've noticed that marshaling objects backwards actually does fix the problem. Since we can marshal the data, then see how much space that took, and write that value with varint. It requires all sub-types to also be "reverse-marshalable".
We're suggesting the creation of a new interface:
type Relahsram struct {
// Marshals backwards.
LahsramOt(d []byte) (int, error)
}
Then, MarshalTo
could be implemented as follows:
(m *RandomType) MarshalTo(d []byte) (int, error) {
return m.LahrsamOt(d[:m.Size()])
}
We have a sort of toy implementation that we've tried, based on the reproduce steps above:
func (m *Test) Marshal() (dAtA []byte, err error) {
size := m.Size()
dAtA = make([]byte, size)
n, err := m.LahrsamOt(dAtA[:size])
if err != nil {
return nil, err
}
return dAtA[:n], nil
}
func (m *Test) MarshalTo(dAtA []byte) (int, error) {
return m.LahrsamOt(dAtA[:m.Size()])
}
func (m *Test) LahrsamOt(dAtA []byte) (int, error) {
i := len(dAtA)
_ = i
var l int
_ = l
if len(m.Tests) > 0 {
for k, _ := range m.Tests {
v := m.Tests[k]
origI := i
if v != nil {
n1, err1 := v.LahrsamOt(dAtA[:i])
if err1 != nil {
return 0, err1
}
i -= n1
i = encodeVarintTest(dAtA, i, uint64(n1))
i--
dAtA[i] = 0x12
}
i -= len(k)
copy(dAtA[i:], k)
i = encodeVarintTest(dAtA, i, uint64(len(k)))
i--
dAtA[i] = 0xa
i = encodeVarintTest(dAtA, i, uint64(origI-i))
i--
dAtA[i] = 0xa
}
}
return len(dAtA) - i, nil
}
func encodeVarintTest(dAtA []byte, offset int, v uint64) int {
offset -= sovTest(v)
base := offset
for v >= 1<<7 {
dAtA[offset] = uint8(v&0x7f | 0x80)
v >>= 7
offset++
}
dAtA[offset] = uint8(v)
return base
}
Obviously we would have to change the generation code in many different ways. @jennybuckley and I have started working on that, but I thought it might be good to keep you updated since it's a pretty invasive change.
We do get encouraging results running the benchmarks again though:
> go test -bench=.
goos: linux
goarch: amd64
BenchmarkWideProtobuf1-12 5000000 255 ns/op
BenchmarkWideProtobuf2-12 5000000 352 ns/op
BenchmarkWideProtobuf5-12 2000000 694 ns/op
BenchmarkWideProtobuf10-12 1000000 1250 ns/op
BenchmarkWideProtobuf20-12 500000 2337 ns/op
BenchmarkWideProtobuf50-12 300000 5840 ns/op
BenchmarkWideProtobuf100-12 100000 11937 ns/op
BenchmarkDeepProtobuf1-12 5000000 245 ns/op
BenchmarkDeepProtobuf2-12 3000000 468 ns/op
BenchmarkDeepProtobuf5-12 1000000 1113 ns/op
BenchmarkDeepProtobuf10-12 500000 2098 ns/op
BenchmarkDeepProtobuf20-12 300000 4156 ns/op
BenchmarkDeepProtobuf50-12 200000 10578 ns/op
BenchmarkDeepProtobuf100-12 100000 22369 ns/op
PASS
ok _/usr/local/google/home/apelisse/code/proto-test 23.010s
What do you think?
the reverse marshaling looks cool.
I did try a size caching approach last night. Not sure if I missed something but here are the changes:
CachedSize method
And then I updated the Size() method to call the cachedSize with dirty = true
( in order to recalculate the size on remarshal.)
dirty call in Size()
updated the Size() call in MarshalTo
to call CachedSize with dirty = false
MarshalTo1
MarshalTo2
Though it might not be the best approach if someone uses MarshalTo without calling Size() first.
Caching the size could be an explicit gogo option, which could enforce a stricter way of marshaling your message.
I get this result compare to your initial benchmark tests running with: go test -bench=. -benchtime=10s
benchmark old ns/op new ns/op delta
BenchmarkWideProtobuf1-10 157 163 +3.82%
BenchmarkWideProtobuf2-10 244 257 +5.33%
BenchmarkWideProtobuf5-10 479 501 +4.59%
BenchmarkWideProtobuf10-10 1065 1040 -2.35%
BenchmarkWideProtobuf20-10 2072 1984 -4.25%
BenchmarkWideProtobuf50-10 5211 5072 -2.67%
BenchmarkWideProtobuf100-10 10173 10138 -0.34%
BenchmarkDeepProtobuf1-10 204 224 +9.80%
BenchmarkDeepProtobuf2-10 473 427 -9.73%
BenchmarkDeepProtobuf5-10 1981 993 -49.87%
BenchmarkDeepProtobuf10-10 6074 1882 -69.02%
BenchmarkDeepProtobuf20-10 20944 3261 -84.43%
BenchmarkDeepProtobuf50-10 125053 8410 -93.27%
BenchmarkDeepProtobuf100-10 481926 17017 -96.47%
It seems like it almost matches your reverse marshal benchmark.
I have mostly two concerns with this approach:
- Is this going to work well if you're serializing the same object in parallel? "Marshal" is typically considered a read-only operations, so people don't necessarily have locks on the objects if it's meant only for "reading". Adding this field in the struct makes it read/write.
- Because of the way kubernetes goes from Go types -> proto -> generated go types (and we drop the new generated go structure and only keep the methods), it means that we would have to embed this new field in all of our struct?
On the other hand, this change is much simpler ...
Adding this field in the struct makes it read/write.
This is true. This will probably invalidate my approach even if it is simpler.
it means that we would have to embed this new field in all of our struct?
Yeh, my change is quite invasive, it implies a field in the message structure. I was just playing around.
We can perhaps add the reverse marshaling as a message/file option. We might just have to add functionality to consider any XXX_unrecognized
data if it is present.
We might just have to add functionality to consider any XXX_unrecognized data if it is present.
Yeah, we just dropped that because we didn't need it :-)
We have a 1kloc patch almost ready for plugin/marshalto/marshalto.go
that does all operations backwards. We're working on passing the tests.
Thanks for the update.