Memory-usage benchmark
Julio-Guerra opened this issue · 15 comments
Hello,
I wrote a benchmark to see how your wonderful package would perform and it looks like it is too memory intensive: ~400 bytes in average are allocated during the lookup (so added to the size of the tree), and ~200B in average to insert a value. For a uint8 tree, it sounds like a lot and unexpected to me. And for a 1,000,000-node tee, it therefore needs ~400MB for the tree and ~200MB for the lookups.
I was expecting only a few bytes would be added for new entries sharing part of their prefix with existing ones.
My results:
goarch: amd64
pkg: github.com/sqreen/go-agent/agent/internal/actor
BenchmarkTree/IPv4/Lookup/Random_addresses/1-8 200000 5630 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/10-8 500000 4715 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/100-8 500000 5361 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/1000-8 300000 5631 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/10000-8 500000 4433 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/100000-8 300000 4095 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/1000000-8 300000 4654 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1-8 500000 4986 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/10-8 300000 4978 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/100-8 200000 5711 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1000-8 200000 5399 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/10000-8 200000 5719 ns/op 412 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/100000-8 300000 4738 ns/op 425 B/op 10 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1000000-8 100000 13806 ns/op 654 B/op 13 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1-8 500000 5120 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/10-8 300000 5549 ns/op 408 B/op 7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/100-8 300000 5479 ns/op 415 B/op 8 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1000-8 200000 7534 ns/op 501 B/op 12 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/10000-8 100000 18245 ns/op 1268 B/op 15 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/100000-8 10000 159465 ns/op 11914 B/op 18 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1000000-8 500 2589345 ns/op 111124 B/op 19 allocs/op
BenchmarkTree/IPv4/Insertion/Consequitive_addresses/1000000-8 3000000 458 ns/op 262 B/op 0 allocs/op
BenchmarkTree/IPv4/Insertion/Random_addresses/1000000-8 1000000 1287 ns/op 222 B/op 0 allocs/op
BenchmarkTree/IPv4/Insertion/Random_networks/1000000-8 2000000 827 ns/op 111 B/op 0 allocs/op
And here is the benchmark:
func BenchmarkTree(b *testing.B) {
b.Run("IPv4", func(b *testing.B) {
b.Run("Lookup", func(b *testing.B) {
b.Run("Random addresses", func(b *testing.B) {
for n := 1; n <= 1000000; n *= 10 {
n := n
tree, cidrs := RandTreeV4(b, n, RandIPv4)
b.Run(fmt.Sprintf("%d", len(cidrs)), func(b *testing.B) {
b.ReportAllocs()
for n := 0; n < b.N; n++ {
// Pick a random CIDR that was inserted
ix := int(rand.Int63n(int64(len(cidrs))))
cidr := cidrs[ix]
tags, err := tree.FindTags(*cidr)
require.NoError(b, err)
require.NotEqual(b, len(tags), 0)
}
})
}
})
b.Run("Realisitic Random Networks", func(b *testing.B) {
for n := 1; n <= 1000000; n *= 10 {
n := n
tree, cidrs := RandTreeV4(b, n, RandRealisticCIDRv4)
b.Run(fmt.Sprintf("%d", len(cidrs)), func(b *testing.B) {
b.ReportAllocs()
for n := 0; n < b.N; n++ {
// Pick a random CIDR that was inserted
ix := int(rand.Int63n(int64(len(cidrs))))
cidr := cidrs[ix]
tags, err := tree.FindTags(*cidr)
require.NoError(b, err)
require.NotEqual(b, len(tags), 0)
}
})
}
})
b.Run("Random Networks", func(b *testing.B) {
for n := 1; n <= 1000000; n *= 10 {
n := n
tree, cidrs := RandTreeV4(b, n, RandCIDRv4)
b.Run(fmt.Sprintf("%d", len(cidrs)), func(b *testing.B) {
b.ReportAllocs()
for n := 0; n < b.N; n++ {
// Pick a random CIDR that was inserted
ix := int(rand.Int63n(int64(len(cidrs))))
cidr := cidrs[ix]
tags, err := tree.FindTags(*cidr)
require.NoError(b, err)
require.NotEqual(b, len(tags), 0)
}
})
}
})
})
b.Run("Insertion", func(b *testing.B) {
cidr, _, err := patricia.ParseIPFromString("1.2.3.4")
require.NoError(b, err)
b.Run("Consequitive addresses/1000000", func(b *testing.B) {
b.ReportAllocs()
tree := uint8_tree.NewTreeV4()
for n := 0; n < b.N; n++ {
cidr.Address += 1
tree.Set(*cidr, 0)
}
})
b.Run("Random addresses/1000000", func(b *testing.B) {
b.ReportAllocs()
tree := uint8_tree.NewTreeV4()
for n := 0; n < b.N; n++ {
cidr.Address = rand.Uint32()
tree.Set(*cidr, 0)
}
})
b.Run("Random networks/1000000", func(b *testing.B) {
b.ReportAllocs()
tree := uint8_tree.NewTreeV4()
for n := 0; n < b.N; n++ {
cidr.Address = rand.Uint32()
cidr.Length = 1 + (uint(rand.Uint32()) % uint(32))
tree.Set(*cidr, 0)
}
})
})
})
}
func RandIPv4() string {
return net.IPv4(byte(rand.Uint32()), byte(rand.Uint32()), byte(rand.Uint32()), byte(rand.Uint32())).String()
}
func RandCIDRv4() string {
return fmt.Sprintf("%s/%d", RandIPv4(), 1+rand.Uint32()%32)
}
func RandRealisticCIDRv4() string {
return fmt.Sprintf("%s/%d", RandIPv4(), 10+(rand.Uint32()%23))
}
What do you think?
Thanks @Julio-Guerra, I really appreciate this! I'm looking into it.
- Blake
@Julio-Guerra Can you provide the rest of your test harness? I wrote my own RandTreeV4
, but am still only seeing 3 allocations/op, or 2 allocations/op if I remove the require
statements. I'd like to run exactly what you're running. Thanks!
@Julio-Guerra - when you say:
~400 bytes in average are allocated during the lookup
what do you mean by:
so added to the size of the tree
?
I'm not asserting that tree searching couldn't be made more efficient, but any allocations made during searching aren't added to the tree. That's either garbage or the returned slice that eventually falls out of scope and is reclaimed. Am I missing something?
Hello @wblakecaldwell
I must admit I totally forgot about this post here ^^
I found where the lookup memory allocation came from: testify assertions I was doing in the benchmark loops...
I wrote benchmarks for insertion, lookups and overall tree size and the results are totally fine now. Note that for the "Size" results, a benchmark "op" is the creation of a tree with the given size (1 to 1,000,000).
Here are the results on my machine:
goos: linux
goarch: amd64
BenchmarkTree/IPv4/Lookup/Random_addresses/1-8 30000000 38.1 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/10-8 20000000 79.7 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/100-8 20000000 104 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/1000-8 10000000 160 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/10000-8 5000000 240 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/100000-8 2000000 593 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/1000000-8 1000000 1348 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1-8 50000000 33.3 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/10-8 20000000 85.6 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/100-8 20000000 110 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1000-8 10000000 159 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/10000-8 5000000 251 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/100000-8 3000000 579 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1000000-8 1000000 1130 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1-8 30000000 33.8 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/10-8 20000000 76.2 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/100-8 20000000 109 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1000-8 10000000 189 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/10000-8 5000000 277 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/100000-8 3000000 413 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1000000-8 1000000 1068 ns/op 0 B/op 0 allocs/op
BenchmarkTree/IPv4/Insertion/Consequitive_addresses-8 5000000 411 ns/op 312 B/op 0 allocs/op
BenchmarkTree/IPv4/Insertion/Random_addresses-8 1000000 1031 ns/op 222 B/op 0 allocs/op
BenchmarkTree/IPv4/Insertion/Random_networks-8 3000000 675 ns/op 131 B/op 0 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/1-8 3000000 409 ns/op 528 B/op 5 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/10-8 500000 2258 ns/op 2563 B/op 8 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/100-8 50000 29411 ns/op 24068 B/op 27 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/1000-8 5000 242016 ns/op 217435 B/op 86 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/10000-8 500 3247224 ns/op 3049317 B/op 337 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/100000-8 50 25945610 ns/op 24589887 B/op 4014 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/1000000-8 3 395557315 ns/op 222662578 B/op 38336 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/1-8 5000000 366 ns/op 528 B/op 5 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/10-8 500000 2641 ns/op 2563 B/op 8 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/100-8 50000 29084 ns/op 24068 B/op 27 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/1000-8 5000 368916 ns/op 217428 B/op 86 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/10000-8 300 4266788 ns/op 3049333 B/op 337 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/100000-8 20 54399927 ns/op 24596554 B/op 4033 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/1000000-8 1 1044954541 ns/op 222687488 B/op 38597 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/1-8 3000000 434 ns/op 528 B/op 5 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/10-8 500000 3042 ns/op 2563 B/op 8 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/100-8 50000 35549 ns/op 24069 B/op 27 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/1000-8 5000 343440 ns/op 217438 B/op 86 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/10000-8 300 4314723 ns/op 1738457 B/op 335 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/100000-8 20 88186177 ns/op 14109788 B/op 4023 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/1000000-8 1 2524714050 ns/op 222674552 B/op 38461 allocs/o
Note that I only benchmarked the subset of the package I use.
And I still think the tree size could be improved: 1,000,000 and 100,000 cases gives me more than 212MB and 23MB respectively. I tried to make sure the random IP creation doesn't allocate memory so that the results only show what it takes to create the tree.
Ya, for sure. Everything can always be improved! From your benchmarks, looks like there's a less than linear increase in memory usage. Not bad! :).
How are you coming up with those numbers? Figuring out how much memory is in use can be really tricky.
For example:
BenchmarkTree/IPv4/Size/Random_networks/1000000-8 1 2524714050 ns/op 222674552 B/op 38461 allocs/op
It uses 222674552 Bytes, so 212 MBytes.
I ran it again with the memory profiler and here are some details:
(pprof) top
Showing nodes accounting for 212.03MB, 99.29% of 213.55MB total
Dropped 18 nodes (cum <= 1.07MB)
flat flat% sum% cum cum%
160.88MB 75.33% 75.33% 212.03MB 99.29% github.com/kentik/patricia/uint8_tree.(*TreeV4).add
51.16MB 23.95% 99.29% 51.16MB 23.95% github.com/kentik/patricia/uint8_tree.(*TreeV4).addTag
The line-based output shows it comes from the slice of nodes :-) So 1,000,000 x sizeof(treeNodeV4) basically.
Can you check this test code and the commands you're running into a repo somewhere? I’ll poke at the code and see what I can come up with.
Can you check this test code and the commands you're running into a repo somewhere? I’ll poke at the code and see what I can come up with.
Awesome. I’ll get back to ya. Thanks!
I found some unused stuff laying around from an old implementation that was wasting 9 bytes per IPv6 node:
9d35c39
Are you using pprof memory profiling on your benchmarks? I think that gives good comparative results, but shows you the total memory allocated across all iterations. In your case, this is moot, since the text output of your benchmark shows that in this one test, there was only one iteration.
I just tried it for this code:
func BenchmarkMem(b *testing.B) {
// run the Fib function b.N times
mySlice := make([]uint64, 0)
for n := 0; n < b.N; n++ {
mySlice = make([]uint64, 100)
mySlice[0] = 1
}
}
Even if that was live code doing all of that in a loop, yes, it allocated 9.24GB of memory, and yes, in a long-running service, your system would show your binary holding that memory (at least briefly), but if there's memory pressure on the system, or the GC is feeling nice, almost all of this will be released back to the system.
I think I can explain what you're seeing.
Since I'm using a slice to hold the nodes, and when needed, I'm increasing its capacity by 2x (just like append
does), you could have an extra (40 bytes) x (node_count)
, where node_count
can be up to 2x the number of entries in the tree if it's packed pretty tight (every node has 2 children). This capacity-growing could also explain why you're seeing a less-than-linear increase in memory - if you added another entry, you'd see zero increase in memory, as it would just use the already-allocated additional capacity. If you look at the allocations for add(...
, you'll notice the doubling. We start with 96bytes, go up to 11.69MB, then a doubling to 23.38MB, then again to 40MB and 80MB. Given that the profiler is showing total number of allocations, and not the total amount of memory in use, I expect that the nodes
slice is only using 80MB (the largest allocation that add(...
is responsible for), not 160MB (the sum of all allocations by add(...
). The additional 80MB is put back in the heap, and will be freed to the system by the runtime if needed.
This is where add(...
is allocating that 160MB:
https://github.com/kentik/patricia/blob/master/uint8_tree/tree_v4.go#L151
This is where addTag(...
is adding to the map:
https://github.com/kentik/patricia/blob/master/uint8_tree/tree_v4.go#L65
Now, as for what addTag(...
is doing... I haven't reviewed the internals of Go's map. Either it's still holding the full 51.16MB or it's not. I need to use a map (for reasons explained in the README), and I'm already using efficient uint64 keys, so this is as good as it's going to get. The map is map[uint64]uint8
, which means if perfectly efficient, adding a tag will add 8 bytes for the key plus 1 byte for the value, for a total of 9 bytes. As far as I can tell from your test code, you're only adding one tag per IP address, so for 1M tags, I expect to see 9MB instead of 51.16MB. If we look at the actual allocations under addTag(...
, we notice the same doubling behavior as in add(...
. I suspect this is the map adding capacity with the number of entries added to it, just like when adding capacity to a slice. I think it's safe to say that the map is actively holding onto between 23.38MB-51.16MB of memory, with 9M required for the keys and values. So, we're looking at somewhere between 14.38MB-42.16MB of overhead (my money is on somewhere between 14.38MB-20MB). There's 1M tags, so that's 14-42 bytes overhead per tag (and again, my money is on 14-20 bytes per tag). Maps will always have overhead. What we're seeing here seems totally reasonable for such a general storage object.
We have instances of this tree storing tens of millions of IP addresses, and for us, the memory usage is acceptable. It's trading a little bit of memory to prevent the GC from having to peg 1-2 cores while scanning millions of pointers. See the README for details.
So, what can be done to improve this?
First option is a little unreasonable: if you knew how many nodes you'd need, I could add a new tree constructor that takes that number, and allocates the nodes array to that capacity. You're not likely to know though, how many nodes you need, since it's entirely dependent on how the tree is packed. If your data set is static, you could build a tree, then ask it the node count, and then build the tree to that size, but that seems like more trouble than it's worth. I suppose I could just ask for how many unique IP addresses you'll be adding, and then multiply that by 1.5, but that still seems hacky.
Second option would certainly use less memory, but adding to the tree would be very expensive. Rather than doubling capacity of the nodes slice when we run out of room, we could just add enough room for one more. I don't see this as reasonable either.
Third option is a middle-ground, and viable for situations where your data set is mostly static, and memory is precious. I could add a Compact()
method to the tree that would rebuild its nodes
array to the exact size that it requires. You'd see a temporary doubling of memory during this time, but the old slice would be given back to the heap, and eventually to the system, if needed. I could also do this with the map - allocating with enough room for the number of items it currently has. Note that this option would appear worse in your memory profiling, because more memory is allocated, but again, it's not showing in-use memory.
I'm in favor of the 3rd option for specialized cases. I'll try it out in a bit, and see if I can get some accurate performance stats on it.
TLDR for anyone that didn't read my previous post:
Profiler shows the sum of all allocations, not the total amount of memory in use. I suspect the tree in this example was using half of what's reported (~100MB instead of 212MB). Each node requires 40 bytes, each tag requires 9 (8 for the map key, 1 for the uint8 value). There's necessarily more nodes than IP addresses. Let's assume 1.5M nodes for the 1M IP addresses, so: 1,500,000 x 40 + 1,000,000 x 9 = 69MB + (slice/map capacity overhead)
I'm going to close this for now, as it's my understanding that the tree was using exactly half as much memory as profiler was reporting, since profiler was reporting allocations, not in-use. I'll add a Compact()
method at some point, which will tighten up the tree once populated.
@Julio-Guerra - Lemme know if you disagree with my assessment, or have any questions or comments about the Compact()
plan. Thanks again for the feedback!
@wblakecaldwell Thanks for having a closer look at it.
Yes, I agree my benchmark results give the sum of everything, which I consider being the memory pressure in these situations.
Regarding Compact()
, I am not sure I would use it as we would double the memory pressure and we are already in a worst case here. But it's worth trying I guess.
Second option would certainly use less memory, but adding to the tree would be very expensive. Rather than doubling capacity of the nodes slice when we run out of room, we could just add enough room for one more. I don't see this as reasonable either.
I'm currently looking at the memory usage and wanted to share a thought. Currently when add() exceeds the capacity of cap(t.nodes) it creates a new copy x2 capacity.
I've been testing changing that from x2 to x1.1
There will be more grow events but at the end I expect less opportunity for wasted capacity.
In my own work I often can know how many nodes I plan to add to the tree. So I'm also experimenting with updating the New() method to allow passing in the initial size.