lmdb: Reduce number of heap allocations of MDB_val
lmb opened this issue ยท 21 comments
I'm currently looking at allocation counts in the codebase I am working on, and I noticed that lmdb-go allocates short lived MDB_val objects, e.g. on every call to Txn.Get.
Would it be possible / safe to cache one or two MDB_val in Txn, and re-use them in Txn.Get and friends? I think there might be an issue with RawRead Txns, but I'm not sure I understand the code base sufficiently.
If this sounds like it could work I'd be happy to contribute a PR. It is not a straight up performance win, but it should reduce the pressure on the GC when querying lots and lots.
Hrm. Storing pointers to two MDB_vals in the Txn object sounds like it might speed things up a little. I don't think RawRead should really be an issue, though to be prudent it might be best to zero-out the MDB_val after each operation in which it is used (e.g. *txn.key = C.MDB_val{}
).
Please go ahead and attempt this change. I am going to merge in #57 when I fix the travis-ci job. But otherwise I would like to see what happens when this change is made.
Worth noting though that at some point in the future I believe this optimization will be unnecessary. I think with the pointer passing restrictions introduced in go1.6 should allows pointer arguments to be considered as non-escaping. But I'm not sure when/if that is really going to happen.
An interesting side effect of the proposed change is that data races from concurrent Txn use (e.g. on a Readonly Txn) would be caught by the Go race detector. In the current implementation the only data races occur in C space, without write barriers, and are invisible to the race detector.
@lmb are you working on putting together a pull request for this? Have you been having any difficulty that I could help with?
My curiosity got the best of me and I quickly threw together something that works as a proof of concept. Indeed it looks like it is possible to benefit from the suggested change, especially during transactions with lots of reads (e.g. a range scan). Turn on RawRead and you are flying, my friend (current benchmark shows %20 reduction in scan time).
After this is settled I definitely think it will be worth revisiting the performance numbers using integer flags (#11).
edit: It's worth noting that write performance isn't really effected because MDB_val structs for writing are stored on the C stack.
Oh, of course. How rude of me. Weekends and holidays especially can fail to register with me. Forgive me, and don't worry about it. I'm glad you enjoyed yourself ๐
I pushed a commit to the branch bmatsuo/reuse-mdb-val. It's is the sum of about 30 minutes of work so it is rough.
All transactions allocate two MDB_vals when they are created. I think there is at least one important thing that could be improved. Specifically, update transactions which will only issue Put commands do not need to allocate these MDB_vals (though PutReserve is special and requires one val), and the benchmarks in which this occur are slightly slower as I have written things.
It's also worth considering that transactions which are opened to Get exactly one key will not get any faster from this change (there is no chance to re-use the allocated MDB_vals). And if you attempt some kind of lazy allocation strategy to accommodate for write-only transactions (as I mentioned in the previous paragraph) then you are going to make these transactions with a single Get slower.
So there is a balancing act that must be done here. I think you really just need to think it through a bit and potentially go through several permutations of ideas in the code. Let me know if you have any thoughts or questions.
Cheers.
No worries, I'll take a look. Which invocation do you use for benchmarks?
Re the single Get use case: in my opinion this will always be the worst
case, since you are incurring the allocation overhead for a tx every single
time (this includes acquiring a semaphore and writing to the lockfile).
While its important to not make this dog slow, anyone trying to seriously
use lmdb from Go has to batch operations anyways. Thoughts?
On 30 Mar 2016 8:23 am, "Bryan Matsuo" notifications@github.com wrote:
Oh, of course. How rude of me. Weekends and holidays especially can fail
to register with me. Forgive me, and don't worry about it. I'm glad you
enjoyed yourself [image: ๐]I pushed a commit to the branch bmatsuo/reuse-mdb-val. It's is the sum of
about 30 minutes of work so it is rough.All transactions allocate two MDB_vals when they are created. I think
there is at least one important thing that could be improved. Specifically,
update transactions which will only issue Put commands do not need to
allocate these MDB_vals (though PutReserve is special and requires one
val), and the benchmarks in which this occur are slightly slower as I have
written things.It's also worth considering that transactions which are opened to Get
exactly one key will not get any faster from this change (there is no
chance to re-use the allocated MDB_vals). And if you attempt some kind of
lazy allocation strategy to accommodate for write-only transactions (as I
mentioned in the previous paragraph) then you are going to make these
transactions with a single Get slower.So there is a balancing act that most be done here. I think you really
just need to think it through a bit and potentially go through several
permutations of ideas in the code. Let me know if you have any thoughts or
questions.Cheers.
โ
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#59 (comment)
When I run the benchmarks I use go test -run=None -bench=.
. In this case adding the -benchmem
flag may be good too. If I happen to be feeling paranoid about runtime variance I will use -benchtime=2s
(the default is 1s).
You are mostly correct RE a single Get. Though I think it is actually probably quite common in reality. The overhead for any single-op transaction may outweigh allocation overhead.
This has got me thinking though, one thing I made no attempt to handle was subtransactions. It should actually be completely safe for subtractions to use their parent's MDB_val objects (parents must never be used while they have an active child). Subtransactions provide a batching mechanism for updates that is potentially quite useful for some applications. And it would be nice for them to benefit from reduced allocations.
Oh. And if you don't know about the benchcmp tool then you should be using that to compare benchmark results for different code versions. Just figured you should know if you didn't :)
I implemented you change, which gets rid of the allocation cost for subtransactions. Regarding slowing down Get, this is what I get when benchmarking the same code twice on my OS X box:
lorenz@antares ~/go/src/github.com/bmatsuo/lmdb-go (reuse-mdb-val)*
$ benchcmp new.txt new2.txt
benchmark old ns/op new ns/op delta
BenchmarkEnv_ReaderList-4 59673 58537 -1.90%
BenchmarkTxn_Put-4 2073 2010 -3.04%
BenchmarkTxn_PutReserve-4 2805 1847 -34.15%
BenchmarkTxn_PutReserve_writemap-4 2101 1525 -27.42%
BenchmarkTxn_Put_writemap-4 2855 1632 -42.84%
BenchmarkTxn_Get_ro-4 1423 1446 +1.62%
BenchmarkTxn_Get_raw_ro-4 1075 855 -20.47%
BenchmarkScan_ro-4 7488830 5804052 -22.50%
BenchmarkScan_raw_ro-4 1710047 1299006 -24.04%
BenchmarkCursor-4 1118 829 -25.85%
BenchmarkCursor_Renew-4 198 196 -1.01%
BenchmarkTxn_Sub_commit-4 293740 149557 -49.09%
BenchmarkTxn_Sub_abort-4 281365 145835 -48.17%
BenchmarkTxn_abort-4 19417 16261 -16.25%
BenchmarkTxn_commit-4 17269 16284 -5.70%
BenchmarkTxn_ro-4 122001 124584 +2.12%
BenchmarkTxn_unmanaged_abort-4 16617 16634 +0.10%
BenchmarkTxn_unmanaged_commit-4 16623 16808 +1.11%
BenchmarkTxn_unmanaged_ro-4 113743 120440 +5.89%
BenchmarkTxn_renew-4 620 575 -7.26%
BenchmarkTxn_Put_append-4 789 808 +2.41%
BenchmarkTxn_Put_append_noflag-4 1013 1021 +0.79%
benchmark old allocs new allocs delta
BenchmarkEnv_ReaderList-4 113 113 +0.00%
BenchmarkTxn_Put-4 4 4 +0.00%
BenchmarkTxn_PutReserve-4 2 2 +0.00%
BenchmarkTxn_PutReserve_writemap-4 2 2 +0.00%
BenchmarkTxn_Put_writemap-4 4 4 +0.00%
BenchmarkTxn_Get_ro-4 3 3 +0.00%
BenchmarkTxn_Get_raw_ro-4 3 2 -33.33%
BenchmarkScan_ro-4 8152 8127 -0.31%
BenchmarkScan_raw_ro-4 5 5 +0.00%
BenchmarkCursor-4 3 3 +0.00%
BenchmarkCursor_Renew-4 0 0 +0.00%
BenchmarkTxn_Sub_commit-4 3 3 +0.00%
BenchmarkTxn_Sub_abort-4 3 3 +0.00%
BenchmarkTxn_abort-4 5 5 +0.00%
BenchmarkTxn_commit-4 5 5 +0.00%
BenchmarkTxn_ro-4 5 5 +0.00%
BenchmarkTxn_unmanaged_abort-4 5 5 +0.00%
BenchmarkTxn_unmanaged_commit-4 5 5 +0.00%
BenchmarkTxn_unmanaged_ro-4 5 5 +0.00%
BenchmarkTxn_renew-4 0 0 +0.00%
BenchmarkTxn_Put_append-4 5 5 +0.00%
BenchmarkTxn_Put_append_noflag-4 5 5 +0.00%
benchmark old bytes new bytes delta
BenchmarkEnv_ReaderList-4 7408 7408 +0.00%
BenchmarkTxn_Put-4 96 96 +0.00%
BenchmarkTxn_PutReserve-4 48 48 +0.00%
BenchmarkTxn_PutReserve_writemap-4 48 48 +0.00%
BenchmarkTxn_Put_writemap-4 96 96 +0.00%
BenchmarkTxn_Get_ro-4 1149 1158 +0.78%
BenchmarkTxn_Get_raw_ro-4 66 65 -1.52%
BenchmarkScan_ro-4 9106023 9202313 +1.06%
BenchmarkScan_raw_ro-4 72 72 +0.00%
BenchmarkCursor-4 33 33 +0.00%
BenchmarkCursor_Renew-4 0 0 +0.00%
BenchmarkTxn_Sub_commit-4 65 65 +0.00%
BenchmarkTxn_Sub_abort-4 65 65 +0.00%
BenchmarkTxn_abort-4 97 97 +0.00%
BenchmarkTxn_commit-4 97 97 +0.00%
BenchmarkTxn_ro-4 97 97 +0.00%
BenchmarkTxn_unmanaged_abort-4 97 97 +0.00%
BenchmarkTxn_unmanaged_commit-4 97 97 +0.00%
BenchmarkTxn_unmanaged_ro-4 97 97 +0.00%
BenchmarkTxn_renew-4 0 0 +0.00%
BenchmarkTxn_Put_append-4 104 104 +0.00%
BenchmarkTxn_Put_append_noflag-4 104 104 +0.00%
Seems like the timing info is very unreliable.
I also looked at reducing the number of allocations done in a Put, but no such luck. Do you know why there are 4 allocations happening? I would expect 2 allocations maybe, for the two unsafe.Pointer coversions.
I believe that the unexpected allocations you see are happening in the BenchmarkTxn_Put function itself (in bench_test.go). It is unfortunate that it throws the numbers off.
It always good to verify tests still pass ๐ But indeed the timing is pretty erratic. And this doesn't seem too different from what I would expect. As long as relevant benchmarks go down that is good. But you typically do see weird spikes on unrelated bits. Like this,
BenchmarkCursor-4 1118 829 -25.85%
No idea why that should be ~300ns faster now.
But you also see a similar reduction in scan time. That is good.
As a general word of caution though, it can be good to take a look at the benchmark to know what is really being timed. Some things may not work as you expect them to (also evidenced from you question about Put allocations).
In the case of BenchmarkScan_raw_ro
, transaction initialization and cleanup is not being counted. It is worth knowing.
Also out of curiosity, what os/arch are you testing on? And are you in a VM?
OS X 10.11.4, x86_64, with the kernel pprof fix applied. How do you want to proceed with this? I can create a PR, but frankly you've done the work. I just added 4 lines for the subtxn case.
I did look at the Put benchmark:
(pprof) list lmdb.BenchmarkTxn_Put.func1
Total: 506370
ROUTINE ======================== github.com/bmatsuo/lmdb-go/lmdb.BenchmarkTxn_Put.func1 in /Users/lorenz/go/src/github.com/bmatsuo/lmdb-go/lmdb/bench_test.go
0 440425 (flat, cum) 86.98% of Total
. . 77: err = env.Update(func(txn *Txn) (err error) {
. . 78: b.ResetTimer()
. . 79: defer b.StopTimer()
. . 80: for i := 0; i < b.N; i++ {
. . 81: k := ps[rand.Intn(len(ps)/2)*2]
. 21 82: v := makeBenchDBVal(&rc)
. 440404 83: err := txn.Put(dbi, k, v, 0)
. . 84: if err != nil {
. . 85: return err
. . 86: }
. . 87: }
. . 88: return nil
If I'm interpreting b.ResetTimer() correctly, only the allocations on line 82 & 93 should be taken into account for the benchmem output. This means that something like 99% of the 4 allocations per Put are really allocated in Put?
I think you can go ahead and open a PR with subtxn stuff. The commit I made is tagged with name/email. I don't think there's an attribution issue or anything.
I'm confused about that pprof output. Is it a memprofile? I actually have never been able to make effective use of a memprofile output for some reason.
Yeah thats a memprofile output, with -alloc_objects passed to go tool pprof
. It shows the number of allocations done per line.
Oh, interesting. Maybe that flag is what has been keeping -memprofile from being useful to me... Thanks for the info.
I think I understand what you meant in your comment now. But I don't think the interpretation makes sense given the data. If it is the correct interpretation then I suspect there is a bug in profiling related to cgo. Lets look at the following two lines.
. 21 82: v := makeBenchDBVal(&rc)
. 440404 83: err := txn.Put(dbi, k, v, 0)
So, makeBenchDBVal allocated 21 objects (not 21 bytes, 21 slices). And txn.Put allocated 20,000X more objects. But, because one call always follows the other that means that txn.Put allocated 20000 objects per iteration on average. But as we've seen the numbers say that the benchmark allocated 4 objects per iteration. So my conclusion has to be that somewhere an assumption of how the tools work is wrong or there is a bug in one of the tools.
The number of txn.Put isn't the sum of all calls to Txn.Put is it? It is the sum of calls for which lmdb.BenchmarkTxn_Put.func1 was directly below it in the stack (potentially even more granular that that). Right? If I'm wrong about that then I'm not sure what to take away from the snippet you have provided ๐
Oh. I think pprof may be counting allocations using malloc in C as well. If you look at the numbers using -alloc_space
you see the size allocated in txn.Put is ~98MB. This may be approximately equal to the total amount of data in the database at the end of the benchmark. (edit: that is probably a bogus statement --- but i think the C malloc thing may be true)
The confusing bit is that the output from go test only includes allocations between ResetTimer and StopTimer, but the memprofile includes all allocations.
I removed the makeBenchDBVal call from the benchmark loop, and go test
still reports 4 allocations per op. Looking at the memory profile shows that all allocations happen at txn.Put. The numbers for txn.Put (in BenchmarkTxn_Put.func1) do not match up with Txn.Put since we populate the DB prior to running the benchmark.
Maybe this should be a separate issue?
lorenz@antares-3 ~/go/src/github.com/bmatsuo/lmdb-go (reuse-mdb-val)*
$ go test -run XXX -bench 'Put$' -benchmem -memprofile mem.profile -memprofilerate 1 ./lmdb
PASS
BenchmarkTxn_Put-4 200000 12771 ns/op 96 B/op 4 allocs/op
--- BENCH: BenchmarkTxn_Put-4
bench_test.go:446: initializing random source data
ok github.com/bmatsuo/lmdb-go/lmdb 3.695s
lorenz@antares-3 ~/go/src/github.com/bmatsuo/lmdb-go (reuse-mdb-val)*
$ go tool pprof -alloc_objects lmdb.test mem.profile
Entering interactive mode (type "help" for commands)
(pprof) list Txn_Put\.func1$
Total: 906361
ROUTINE ======================== github.com/bmatsuo/lmdb-go/lmdb.BenchmarkTxn_Put.func1 in /Users/lorenz/go/src/github.com/bmatsuo/lmdb-go/lmdb/bench_test.go
4 840408 (flat, cum) 92.72% of Total
. . 73: b.Errorf("populate db: %v", err)
. . 74: return
. . 75: }
. . 76:
. . 77: err = env.Update(func(txn *Txn) (err error) {
4 4 78: v := make([]byte, 100)
. . 79:
. . 80: b.ResetTimer()
. . 81: defer b.StopTimer()
. . 82: for i := 0; i < b.N; i++ {
. . 83: k := ps[rand.Intn(len(ps)/2)*2]
. . 84: // v := makeBenchDBVal(&rc)
. 840404 85: err := txn.Put(dbi, k, v, 0)
. . 86: if err != nil {
. . 87: return err
. . 88: }
. . 89: }
. . 90: return nil
(pprof) list \(\*Txn\)\.Put$
Total: 906361
ROUTINE ======================== github.com/bmatsuo/lmdb-go/lmdb.(*Txn).Put in /Users/lorenz/go/src/github.com/bmatsuo/lmdb-go/lmdb/txn.go
905940 905940 (flat, cum) 100% of Total
. . 314: vn := len(val)
. . 315: if vn == 0 {
. . 316: val = []byte{0}
. . 317: }
. . 318:
905940 905940 319: ret := C.lmdbgo_mdb_put2(
. . 320: txn._txn, C.MDB_dbi(dbi),
. . 321: unsafe.Pointer(&key[0]), C.size_t(kn),
. . 322: unsafe.Pointer(&val[0]), C.size_t(vn),
. . 323: C.uint(flags),
. . 324: )
Umm. Wait.. So you made a change which should have reduced the number of allocs/op as reported by the benchmark output, and the number of allocs/op did not change? This sounds like a bug in "testing" package, then. No?
This should probably be in a separate issue regardless.