Memory-usage benchmark

Question

Memory-usage benchmark

Julio-Guerra opened this issue 6 years ago · 15 comments

Hello,

I wrote a benchmark to see how your wonderful package would perform and it looks like it is too memory intensive: ~400 bytes in average are allocated during the lookup (so added to the size of the tree), and ~200B in average to insert a value. For a uint8 tree, it sounds like a lot and unexpected to me. And for a 1,000,000-node tee, it therefore needs ~400MB for the tree and ~200MB for the lookups.

I was expecting only a few bytes would be added for new entries sharing part of their prefix with existing ones.

My results:

goarch: amd64
pkg: github.com/sqreen/go-agent/agent/internal/actor
BenchmarkTree/IPv4/Lookup/Random_addresses/1-8            200000              5630 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/10-8           500000              4715 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/100-8          500000              5361 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/1000-8         300000              5631 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/10000-8        500000              4433 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/100000-8       300000              4095 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/1000000-8      300000              4654 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1-8          500000              4986 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/10-8         300000              4978 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/100-8        200000              5711 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1000-8       200000              5399 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/10000-8      200000              5719 ns/op             412 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/100000-8     300000              4738 ns/op             425 B/op         10 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1000000-8    100000             13806 ns/op             654 B/op         13 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1-8                     500000              5120 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/10-8                    300000              5549 ns/op             408 B/op          7 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/100-8                   300000              5479 ns/op             415 B/op          8 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1000-8                  200000              7534 ns/op             501 B/op         12 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/10000-8                 100000             18245 ns/op            1268 B/op         15 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/100000-8                 10000            159465 ns/op           11914 B/op         18 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1000000-8                  500           2589345 ns/op          111124 B/op         19 allocs/op
BenchmarkTree/IPv4/Insertion/Consequitive_addresses/1000000-8    3000000               458 ns/op             262 B/op          0 allocs/op
BenchmarkTree/IPv4/Insertion/Random_addresses/1000000-8          1000000              1287 ns/op             222 B/op          0 allocs/op
BenchmarkTree/IPv4/Insertion/Random_networks/1000000-8           2000000               827 ns/op             111 B/op          0 allocs/op

And here is the benchmark:

func BenchmarkTree(b *testing.B) {
	b.Run("IPv4", func(b *testing.B) {
		b.Run("Lookup", func(b *testing.B) {
			b.Run("Random addresses", func(b *testing.B) {
				for n := 1; n <= 1000000; n *= 10 {
					n := n
					tree, cidrs := RandTreeV4(b, n, RandIPv4)
					b.Run(fmt.Sprintf("%d", len(cidrs)), func(b *testing.B) {
						b.ReportAllocs()
						for n := 0; n < b.N; n++ {
							// Pick a random CIDR that was inserted
							ix := int(rand.Int63n(int64(len(cidrs))))
							cidr := cidrs[ix]
							tags, err := tree.FindTags(*cidr)
							require.NoError(b, err)
							require.NotEqual(b, len(tags), 0)
						}
					})
				}
			})

			b.Run("Realisitic Random Networks", func(b *testing.B) {
				for n := 1; n <= 1000000; n *= 10 {
					n := n
					tree, cidrs := RandTreeV4(b, n, RandRealisticCIDRv4)
					b.Run(fmt.Sprintf("%d", len(cidrs)), func(b *testing.B) {
						b.ReportAllocs()
						for n := 0; n < b.N; n++ {
							// Pick a random CIDR that was inserted
							ix := int(rand.Int63n(int64(len(cidrs))))
							cidr := cidrs[ix]
							tags, err := tree.FindTags(*cidr)
							require.NoError(b, err)
							require.NotEqual(b, len(tags), 0)
						}
					})
				}
			})

			b.Run("Random Networks", func(b *testing.B) {
				for n := 1; n <= 1000000; n *= 10 {
					n := n
					tree, cidrs := RandTreeV4(b, n, RandCIDRv4)
					b.Run(fmt.Sprintf("%d", len(cidrs)), func(b *testing.B) {
						b.ReportAllocs()
						for n := 0; n < b.N; n++ {
							// Pick a random CIDR that was inserted
							ix := int(rand.Int63n(int64(len(cidrs))))
							cidr := cidrs[ix]
							tags, err := tree.FindTags(*cidr)
							require.NoError(b, err)
							require.NotEqual(b, len(tags), 0)
						}
					})
				}
			})
		})

		b.Run("Insertion", func(b *testing.B) {
			cidr, _, err := patricia.ParseIPFromString("1.2.3.4")
			require.NoError(b, err)

			b.Run("Consequitive addresses/1000000", func(b *testing.B) {
				b.ReportAllocs()
				tree := uint8_tree.NewTreeV4()
				for n := 0; n < b.N; n++ {
					cidr.Address += 1
					tree.Set(*cidr, 0)
				}
			})

			b.Run("Random addresses/1000000", func(b *testing.B) {
				b.ReportAllocs()
				tree := uint8_tree.NewTreeV4()
				for n := 0; n < b.N; n++ {
					cidr.Address = rand.Uint32()
					tree.Set(*cidr, 0)
				}
			})

			b.Run("Random networks/1000000", func(b *testing.B) {
				b.ReportAllocs()
				tree := uint8_tree.NewTreeV4()
				for n := 0; n < b.N; n++ {
					cidr.Address = rand.Uint32()
					cidr.Length = 1 + (uint(rand.Uint32()) % uint(32))
					tree.Set(*cidr, 0)
				}
			})
		})
	})
}

func RandIPv4() string {
	return net.IPv4(byte(rand.Uint32()), byte(rand.Uint32()), byte(rand.Uint32()), byte(rand.Uint32())).String()
}

func RandCIDRv4() string {
	return fmt.Sprintf("%s/%d", RandIPv4(), 1+rand.Uint32()%32)
}

func RandRealisticCIDRv4() string {
	return fmt.Sprintf("%s/%d", RandIPv4(), 10+(rand.Uint32()%23))
}

What do you think?

Answer 1 · 2019-04-04T05:05:26.000Z

Thanks @Julio-Guerra, I really appreciate this! I'm looking into it.

Blake

Answer 2 · 2019-04-04T05:09:53.000Z

@Julio-Guerra Can you provide the rest of your test harness? I wrote my own RandTreeV4, but am still only seeing 3 allocations/op, or 2 allocations/op if I remove the require statements. I'd like to run exactly what you're running. Thanks!

Answer 3 · 2019-04-04T05:20:40.000Z

@Julio-Guerra - when you say:

~400 bytes in average are allocated during the lookup

what do you mean by:

so added to the size of the tree

?

I'm not asserting that tree searching couldn't be made more efficient, but any allocations made during searching aren't added to the tree. That's either garbage or the returned slice that eventually falls out of scope and is reclaimed. Am I missing something?

Answer 4 · 2019-04-04T08:21:34.000Z

Hello @wblakecaldwell
I must admit I totally forgot about this post here ^^
I found where the lookup memory allocation came from: testify assertions I was doing in the benchmark loops...

I wrote benchmarks for insertion, lookups and overall tree size and the results are totally fine now. Note that for the "Size" results, a benchmark "op" is the creation of a tree with the given size (1 to 1,000,000).

Here are the results on my machine:

goos: linux
goarch: amd64
BenchmarkTree/IPv4/Lookup/Random_addresses/1-8          30000000                38.1 ns/op             0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/10-8         20000000                79.7 ns/op             0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/100-8        20000000               104 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/1000-8       10000000               160 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/10000-8       5000000               240 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/100000-8      2000000               593 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_addresses/1000000-8     1000000              1348 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1-8        50000000                33.3 ns/op             0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/10-8       20000000                85.6 ns/op             0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/100-8      20000000               110 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1000-8     10000000               159 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/10000-8     5000000               251 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/100000-8    3000000               579 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Realisitic_Random_Networks/1000000-8   1000000              1130 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1-8                   30000000                33.8 ns/op             0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/10-8                  20000000                76.2 ns/op             0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/100-8                 20000000               109 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1000-8                10000000               189 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/10000-8                5000000               277 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/100000-8               3000000               413 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Lookup/Random_Networks/1000000-8              1000000              1068 ns/op               0 B/op          0 allocs/op
BenchmarkTree/IPv4/Insertion/Consequitive_addresses-8            5000000               411 ns/op             312 B/op          0 allocs/op
BenchmarkTree/IPv4/Insertion/Random_addresses-8                  1000000              1031 ns/op             222 B/op          0 allocs/op
BenchmarkTree/IPv4/Insertion/Random_networks-8                   3000000               675 ns/op             131 B/op          0 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/1-8               3000000               409 ns/op             528 B/op          5 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/10-8               500000              2258 ns/op            2563 B/op          8 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/100-8               50000             29411 ns/op           24068 B/op         27 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/1000-8               5000            242016 ns/op          217435 B/op         86 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/10000-8               500           3247224 ns/op         3049317 B/op        337 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/100000-8               50          25945610 ns/op        24589887 B/op       4014 allocs/op
BenchmarkTree/IPv4/Size/Consequitive_addresses/1000000-8               3         395557315 ns/op        222662578 B/op     38336 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/1-8                     5000000               366 ns/op             528 B/op          5 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/10-8                     500000              2641 ns/op            2563 B/op          8 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/100-8                     50000             29084 ns/op           24068 B/op         27 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/1000-8                     5000            368916 ns/op          217428 B/op         86 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/10000-8                     300           4266788 ns/op         3049333 B/op        337 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/100000-8                     20          54399927 ns/op        24596554 B/op       4033 allocs/op
BenchmarkTree/IPv4/Size/Random_addresses/1000000-8                     1        1044954541 ns/op        222687488 B/op     38597 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/1-8                      3000000               434 ns/op             528 B/op          5 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/10-8                      500000              3042 ns/op            2563 B/op          8 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/100-8                      50000             35549 ns/op           24069 B/op         27 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/1000-8                      5000            343440 ns/op          217438 B/op         86 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/10000-8                      300           4314723 ns/op         1738457 B/op        335 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/100000-8                      20          88186177 ns/op        14109788 B/op       4023 allocs/op
BenchmarkTree/IPv4/Size/Random_networks/1000000-8                      1        2524714050 ns/op        222674552 B/op     38461 allocs/o

Source: https://github.com/sqreen/go-agent/blob/b7243c25ab62d05ce355da46398d94baf56e8056/agent/internal/actor/treev4_test.go#L173-L318

Answer 5 · 2019-04-04T08:46:41.000Z

Note that I only benchmarked the subset of the package I use.
And I still think the tree size could be improved: 1,000,000 and 100,000 cases gives me more than 212MB and 23MB respectively. I tried to make sure the random IP creation doesn't allocate memory so that the results only show what it takes to create the tree.

Answer 6 · 2019-04-04T20:14:59.000Z

Ya, for sure. Everything can always be improved! From your benchmarks, looks like there's a less than linear increase in memory usage. Not bad! :).

How are you coming up with those numbers? Figuring out how much memory is in use can be really tricky.

Answer 7 · 2019-04-04T20:49:57.000Z

For example:

BenchmarkTree/IPv4/Size/Random_networks/1000000-8                      1        2524714050 ns/op        222674552 B/op     38461 allocs/op

It uses 222674552 Bytes, so 212 MBytes.

I ran it again with the memory profiler and here are some details:

(pprof) top
Showing nodes accounting for 212.03MB, 99.29% of 213.55MB total
Dropped 18 nodes (cum <= 1.07MB)
      flat  flat%   sum%        cum   cum%
  160.88MB 75.33% 75.33%   212.03MB 99.29%  github.com/kentik/patricia/uint8_tree.(*TreeV4).add
   51.16MB 23.95% 99.29%    51.16MB 23.95%  github.com/kentik/patricia/uint8_tree.(*TreeV4).addTag

The line-based output shows it comes from the slice of nodes :-) So 1,000,000 x sizeof(treeNodeV4) basically.

Answer 8 · 2019-04-04T21:43:19.000Z

Can you check this test code and the commands you're running into a repo somewhere? I’ll poke at the code and see what I can come up with.

Answer 9 · 2019-04-04T21:50:45.000Z

Can you check this test code and the commands you're running into a repo somewhere? I’ll poke at the code and see what I can come up with.

Sure: https://github.com/sqreen/go-agent/blob/cec308e7a514eda993e7a18caf3c1c97a70cae28/agent/internal/actor/treev4_test.go#L134

Answer 10 · 2019-04-04T22:14:57.000Z

Awesome. I’ll get back to ya. Thanks!

Answer 11 · 2019-04-05T15:37:09.000Z

I found some unused stuff laying around from an old implementation that was wasting 9 bytes per IPv6 node:
9d35c39

Are you using pprof memory profiling on your benchmarks? I think that gives good comparative results, but shows you the total memory allocated across all iterations. In your case, this is moot, since the text output of your benchmark shows that in this one test, there was only one iteration.

I just tried it for this code:

func BenchmarkMem(b *testing.B) {
    // run the Fib function b.N times
    mySlice := make([]uint64, 0)
    for n := 0; n < b.N; n++ {
        mySlice = make([]uint64, 100)
        mySlice[0] = 1
    }
}

and see these results:

Even if that was live code doing all of that in a loop, yes, it allocated 9.24GB of memory, and yes, in a long-running service, your system would show your binary holding that memory (at least briefly), but if there's memory pressure on the system, or the GC is feeling nice, almost all of this will be released back to the system.

I think I can explain what you're seeing.

Since I'm using a slice to hold the nodes, and when needed, I'm increasing its capacity by 2x (just like append does), you could have an extra (40 bytes) x (node_count), where node_count can be up to 2x the number of entries in the tree if it's packed pretty tight (every node has 2 children). This capacity-growing could also explain why you're seeing a less-than-linear increase in memory - if you added another entry, you'd see zero increase in memory, as it would just use the already-allocated additional capacity. If you look at the allocations for add(..., you'll notice the doubling. We start with 96bytes, go up to 11.69MB, then a doubling to 23.38MB, then again to 40MB and 80MB. Given that the profiler is showing total number of allocations, and not the total amount of memory in use, I expect that the nodes slice is only using 80MB (the largest allocation that add(... is responsible for), not 160MB (the sum of all allocations by add(...). The additional 80MB is put back in the heap, and will be freed to the system by the runtime if needed.

This is where add(... is allocating that 160MB:
https://github.com/kentik/patricia/blob/master/uint8_tree/tree_v4.go#L151

This is where addTag(... is adding to the map:
https://github.com/kentik/patricia/blob/master/uint8_tree/tree_v4.go#L65

Now, as for what addTag(... is doing... I haven't reviewed the internals of Go's map. Either it's still holding the full 51.16MB or it's not. I need to use a map (for reasons explained in the README), and I'm already using efficient uint64 keys, so this is as good as it's going to get. The map is map[uint64]uint8, which means if perfectly efficient, adding a tag will add 8 bytes for the key plus 1 byte for the value, for a total of 9 bytes. As far as I can tell from your test code, you're only adding one tag per IP address, so for 1M tags, I expect to see 9MB instead of 51.16MB. If we look at the actual allocations under addTag(..., we notice the same doubling behavior as in add(.... I suspect this is the map adding capacity with the number of entries added to it, just like when adding capacity to a slice. I think it's safe to say that the map is actively holding onto between 23.38MB-51.16MB of memory, with 9M required for the keys and values. So, we're looking at somewhere between 14.38MB-42.16MB of overhead (my money is on somewhere between 14.38MB-20MB). There's 1M tags, so that's 14-42 bytes overhead per tag (and again, my money is on 14-20 bytes per tag). Maps will always have overhead. What we're seeing here seems totally reasonable for such a general storage object.

We have instances of this tree storing tens of millions of IP addresses, and for us, the memory usage is acceptable. It's trading a little bit of memory to prevent the GC from having to peg 1-2 cores while scanning millions of pointers. See the README for details.

So, what can be done to improve this?

First option is a little unreasonable: if you knew how many nodes you'd need, I could add a new tree constructor that takes that number, and allocates the nodes array to that capacity. You're not likely to know though, how many nodes you need, since it's entirely dependent on how the tree is packed. If your data set is static, you could build a tree, then ask it the node count, and then build the tree to that size, but that seems like more trouble than it's worth. I suppose I could just ask for how many unique IP addresses you'll be adding, and then multiply that by 1.5, but that still seems hacky.

Second option would certainly use less memory, but adding to the tree would be very expensive. Rather than doubling capacity of the nodes slice when we run out of room, we could just add enough room for one more. I don't see this as reasonable either.

Third option is a middle-ground, and viable for situations where your data set is mostly static, and memory is precious. I could add a Compact() method to the tree that would rebuild its nodes array to the exact size that it requires. You'd see a temporary doubling of memory during this time, but the old slice would be given back to the heap, and eventually to the system, if needed. I could also do this with the map - allocating with enough room for the number of items it currently has. Note that this option would appear worse in your memory profiling, because more memory is allocated, but again, it's not showing in-use memory.

I'm in favor of the 3rd option for specialized cases. I'll try it out in a bit, and see if I can get some accurate performance stats on it.

Answer 12 · 2019-04-05T15:44:07.000Z

TLDR for anyone that didn't read my previous post:

Profiler shows the sum of all allocations, not the total amount of memory in use. I suspect the tree in this example was using half of what's reported (~100MB instead of 212MB). Each node requires 40 bytes, each tag requires 9 (8 for the map key, 1 for the uint8 value). There's necessarily more nodes than IP addresses. Let's assume 1.5M nodes for the 1M IP addresses, so: 1,500,000 x 40 + 1,000,000 x 9 = 69MB + (slice/map capacity overhead)

Answer 13 · 2019-04-09T15:43:14.000Z

I'm going to close this for now, as it's my understanding that the tree was using exactly half as much memory as profiler was reporting, since profiler was reporting allocations, not in-use. I'll add a Compact() method at some point, which will tighten up the tree once populated.

@Julio-Guerra - Lemme know if you disagree with my assessment, or have any questions or comments about the Compact() plan. Thanks again for the feedback!

Answer 14 · 2019-04-25T08:29:57.000Z

@wblakecaldwell Thanks for having a closer look at it.
Yes, I agree my benchmark results give the sum of everything, which I consider being the memory pressure in these situations.

Regarding Compact(), I am not sure I would use it as we would double the memory pressure and we are already in a worst case here. But it's worth trying I guess.

Answer 15 · 2023-06-28T19:37:48.000Z

Second option would certainly use less memory, but adding to the tree would be very expensive. Rather than doubling capacity of the nodes slice when we run out of room, we could just add enough room for one more. I don't see this as reasonable either.

I'm currently looking at the memory usage and wanted to share a thought. Currently when add() exceeds the capacity of cap(t.nodes) it creates a new copy x2 capacity.

I've been testing changing that from x2 to x1.1

There will be more grow events but at the end I expect less opportunity for wasted capacity.

In my own work I often can know how many nodes I plan to add to the tree. So I'm also experimenting with updating the New() method to allow passing in the initial size.