performance regression on Windows
Opened this issue · 3 comments
after updating my dependencies I had a performance regression from 0.5s to 1.8s. I tracked it down to repeated calls to allocation_granularity()
every time that consume()
is called in my buf_redux created ring buffer (they updated their slice_deque dependency from 0.1 to 0.2) . allocation_granularity() calls GetSystemInfo()
on Windows which is very slow (1.3 seconds of my 1.8 second run-time).
can you tell if all of the calls to allocation_granularity are necessary or if buf_redux is misusing the API? Stack trace looks like this:
5: slice_deque::mirrored::winapi::allocation_granularity
at C:\Users\icooke\tmp\slice_deque\src\mirrored\winapi.rs:49
6: slice_deque::mirrored::buffer::Buffer<u8>::size_in_bytes<u8>
at C:\Users\icooke\tmp\slice_deque\src\mirrored\buffer.rs:112
7: slice_deque::SliceDeque<u8>::move_head_unchecked<u8>
at C:\Users\icooke\tmp\slice_deque\src\lib.rs:683
8: slice_deque::SliceDeque<u8>::move_head<u8>
at C:\Users\icooke\tmp\slice_deque\src\lib.rs:735
9: buf_redux::buffer::slice_deque_buf::SliceDequeBuf::consume
at C:\Users\icooke\tmp\buf_redux\src\buffer\slice_deque_buf.rs:81
10: buf_redux::buffer::BufImpl::consume
at C:\Users\icooke\tmp\buf_redux\src\buffer\mod.rs:42
11: buf_redux::Buffer::consume
at C:\Users\icooke\tmp\buf_redux\src\lib.rs:1175
12: buf_redux::{{impl}}::consume<std::fs::File,buf_redux::policy::MinBuffered>
at C:\Users\icooke\tmp\buf_redux\src\lib.rs:413
13: std::io::read_until<buf_redux::BufReader<std::fs::File, buf_redux::policy::MinBuffered>>
at /rustc/4560ea788cb760f0a34127156c78e2552949f734\src\libstd\io\mod.rs:1625
14: std::io::BufRead::read_line::{{closure}}<buf_redux::BufReader<std::fs::File, buf_redux::policy::MinBuffered>>
at /rustc/4560ea788cb760f0a34127156c78e2552949f734\src\libstd\io\mod.rs:1866
buf_redux code is here: https://github.com/abonander/buf_redux/blob/4c59a0f6c423f80bcd099f543c547d449d92b1d3/src/buffer/slice_deque_buf.rs#L80
so you can see every line I read involves a GetSystemInfo()
call which is killing me.
If the calls to allocation_granularity() are necessary I recommend caching the value somehow. I made a local copy of slice_deque and used lazy_static to calculate it once and my performance did return to 0.5s. If you want to add the dependency on lazy_static I can put together a pull request for that.
Dear @ian-p-cooke, thank you for the report!
I've pushed a PR here: #80 with a potential fix. Sadly, I don't have a windows machine to test if it solves your issue.
Would you be so kind of patching your slice_deque
dependency to use the branch in my PR, and try it out ?
Thank you!
@gnzlbg that works well for me! performance is better than my branch with lazy-static and better than slice-deque 0.1 by a little bit too. All tests pass for me.
Any update on this? I'm also on Windows..