Memory leak on high number of concurrent connections
prasannavl opened this issue · 57 comments
Hyper version: 0.12.25
Context
Server
- Run any hyper/hyper derived (warp, tide, etc) server, that just serves a simple hello text response.
use warp::Filter;
fn main() -> Result<(), exitfailure::ExitFailure> {
let hello = warp::any().map(|| "Hello there!");
info!("starting");
let addr: std::net::SocketAddr = "127.0.0.1:9000".parse()?;
warp::serve(hello)
.run(addr);
Ok(())
}
(For brevity, I've used the warp app, but a basic hyper sample works as well - Here's a simple hello world servers off several hyper based servers (tide, warp, raw hyper, etc) that I used to test: https://github.com/quixon-labs/playground-rs)
Boom! (Stress tests)
- Start bombarding the server with multiple concurrent connections, increasing them gradually.
for concurrency in 100 500 1000 2000 5000 10000 15000 20000 30000; do
h2load -D10 -c "$concurrency" http://localhost:9000/hello
sleep 5s
done
Memory observation
- Observe the memory rise up steadily but in sporadic intervals, hitting anywhere between 130MB to 300MB. (In my tests, the release builds significantly had higher usage, since they are able to handle a higher load, and lesser connections end up being dropped as a result).
- This keeps increasing definitively as the concurrency increases, or sporadically otherwise. The only deterministic fact I'm able to observe is that it remains consistent without increased concurrency.
- But they never come back down once they go up, as one might expect even when there is no load and after the system has had the enough time to play catch up closing all TCP connections, settling IO, kernel swapping, whatever, if any etc.
It doesn't go beyond 5 threads. Note, when tokio blocking functions are used, this can rise up all the way to 100 threads (which don't go back down yet) by default causing a significant memory bump. So, used a simpler example to sidestep this as the cause.
Heap profiles
- Running this with valgrid (massif):
Here's the relevant bit:
(I'm not entirely sure what exactly is causing this. My first guess was some thread_local
cache per connection, but couldn't really see any evidence to co-relate that yet). I'm also not sure if tokio or hyper is the cause. But it is consistently noticeable in all hyper based servers.
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
62 83,195,458,515 213,409,040 212,742,984 666,056 0
99.69% (212,742,984B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->41.87% (89,363,456B) 0x958C5A: alloc::alloc::alloc (alloc.rs:75)
| ->41.76% (89,112,576B) 0x958A40: <alloc::alloc::Global as core::alloc::Alloc>::alloc (alloc.rs:151)
| | ->41.76% (89,112,576B) 0x95A490: alloc::raw_vec::RawVec<T,A>::allocate_in (raw_vec.rs:95)
| | ->41.76% (89,112,576B) 0x95A284: alloc::raw_vec::RawVec<T>::with_capacity (raw_vec.rs:139)
| | ->41.76% (89,112,576B) 0x957002: alloc::vec::Vec<T>::with_capacity (vec.rs:355)
| | ->41.76% (89,112,576B) 0x955CF4: bytes::bytes::Inner::reserve (bytes.rs:2261)
| | | ->41.76% (89,112,576B) 0x9542E6: bytes::bytes::BytesMut::reserve (bytes.rs:1444)
| | | ->41.76% (89,112,576B) 0x291A81: hyper::proto::h1::io::Buffered<T,B>::read_from_io (io.rs:175)
| | | ->41.76% (89,112,576B) 0x293D6C: hyper::proto::h1::io::Buffered<T,B>::parse (io.rs:160)
| | | ->41.76% (89,112,576B) 0x29B18F: hyper::proto::h1::conn::Conn<I,B,T>::read_head (conn.rs:137)
| | | ->41.76% (89,112,576B) 0x217ADA: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_read_head (dispatch.rs:200)
| | | ->41.76% (89,112,576B) 0x218D96: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_read (dispatch.rs:136)
| | | ->41.76% (89,112,576B) 0x214EEF: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_inner (dispatch.rs:100)
| | | ->41.76% (89,112,576B) 0x214A67: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch (dispatch.rs:86)
| | | ->41.76% (89,112,576B) 0x21471E: <hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T> as futures::future::Future>::poll (dispatch.rs:348)
| | | ->41.76% (89,112,576B) 0x249788: <futures::future::either::Either<A,B> as futures::future::Future>::poll (either.rs:35)
| | | ->41.76% (89,112,576B) 0x28AE00: futures::future::option::<impl futures::future::Future for core::option::Option<F>>::poll (option.rs:12)
| | | ->41.76% (89,112,576B) 0x24E556: <hyper::server::conn::upgrades::UpgradeableConnection<I,S,E> as futures::future::Future>::poll (conn.rs:934)
| | | ->41.76% (89,112,576B) 0x241D49: <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as futures::future::Future>::poll (conn.rs:874)
| | | ->41.76% (89,112,576B) 0x91EBE3: <alloc::boxed::Box<F> as futures::future::Future>::poll (mod.rs:113)
| | | ->41.76% (89,112,576B) 0x7025E2: futures::task_impl::Spawn<T>::poll_future_notify::{{closure}} (mod.rs:326)
| | | ->41.76% (89,112,576B) 0x70285F: futures::task_impl::Spawn<T>::enter::{{closure}} (mod.rs:396)
| | | ->41.76% (89,112,576B) 0x72E3C5: futures::task_impl::std::set (mod.rs:78)
| | | ->41.76% (89,112,576B) 0x7027FD: futures::task_impl::Spawn<T>::enter (mod.rs:396)
| | | ->41.76% (89,112,576B) 0x702396: futures::task_impl::Spawn<T>::poll_fn_notify (mod.rs:288)
| | | ->41.76% (89,112,576B) 0x702586: futures::task_impl::Spawn<T>::poll_future_notify (mod.rs:326)
| | | ->41.76% (89,112,576B) 0x70C200: tokio_threadpool::task::Task::run::{{closure}} (mod.rs:145)
| | | ->41.76% (89,112,576B) 0x715FE7: core::ops::function::FnOnce::call_once (function.rs:231)
| | | ->41.76% (89,112,576B) 0x704497: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (panic.rs:309)
| | | ->41.76% (89,112,576B) 0x7020D1: std::panicking::try::do_call (panicking.rs:293)
| | |
| | ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
| |
| ->00.12% (250,880B) in 1+ places, all below ms_print's threshold (01.00%)
|
->41.76% (89,112,618B) 0x828E4A: alloc::alloc::alloc (alloc.rs:75)
| ->41.76% (89,112,618B) 0x828C30: <alloc::alloc::Global as core::alloc::Alloc>::alloc (alloc.rs:151)
| ->41.76% (89,112,618B) 0x827A90: alloc::raw_vec::RawVec<T,A>::allocate_in (raw_vec.rs:95)
| ->41.76% (89,112,618B) 0x8278D4: alloc::raw_vec::RawVec<T>::with_capacity (raw_vec.rs:139)
| ->41.76% (89,112,618B) 0x828852: alloc::vec::Vec<T>::with_capacity (vec.rs:355)
| ->41.76% (89,112,576B) 0x290E55: hyper::proto::h1::io::WriteBuf<B>::new (io.rs:413)
| | ->41.76% (89,112,576B) 0x292974: hyper::proto::h1::io::Buffered<T,B>::new (io.rs:65)
| | ->41.76% (89,112,576B) 0x29927E: hyper::proto::h1::conn::Conn<I,B,T>::new (conn.rs:39)
| | ->41.76% (89,112,576B) 0x1DD2A7: hyper::server::conn::Http<E>::serve_connection (conn.rs:374)
| | ->41.76% (89,112,576B) 0x1DFB9A: <hyper::server::conn::Connecting<I,F,E> as futures::future::Future>::poll (conn.rs:732)
| | ->41.76% (89,112,576B) 0x241ADF: <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as futures::future::Future>::poll (conn.rs:864)
| | ->41.76% (89,112,576B) 0x91EBE3: <alloc::boxed::Box<F> as futures::future::Future>::poll (mod.rs:113)
| | ->41.76% (89,112,576B) 0x7025E2: futures::task_impl::Spawn<T>::poll_future_notify::{{closure}} (mod.rs:326)
| | ->41.76% (89,112,576B) 0x70285F: futures::task_impl::Spawn<T>::enter::{{closure}} (mod.rs:396)
| | ->41.76% (89,112,576B) 0x72E3C5: futures::task_impl::std::set (mod.rs:78)
| | ->41.76% (89,112,576B) 0x7027FD: futures::task_impl::Spawn<T>::enter (mod.rs:396)
| | ->41.76% (89,112,576B) 0x702396: futures::task_impl::Spawn<T>::poll_fn_notify (mod.rs:288)
| | ->41.76% (89,112,576B) 0x702586: futures::task_impl::Spawn<T>::poll_future_notify (mod.rs:326)
| | ->41.76% (89,112,576B) 0x70C200: tokio_threadpool::task::Task::run::{{closure}} (mod.rs:145)
| | ->41.76% (89,112,576B) 0x715FE7: core::ops::function::FnOnce::call_once (function.rs:231)
| | ->41.76% (89,112,576B) 0x704497: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (panic.rs:309)
| | ->41.76% (89,112,576B) 0x7020D1: std::panicking::try::do_call (panicking.rs:293)
| | ->41.76% (89,112,576B) 0x979428: __rust_maybe_catch_panic (lib.rs:87)
| | ->41.76% (89,112,576B) 0x701E0E: std::panicking::try (panicking.rs:272)
| | ->41.76% (89,112,576B) 0x70457F: std::panic::catch_unwind (panic.rs:388)
| | ->41.76% (89,112,576B) 0x70BC84: tokio_threadpool::task::Task::run (mod.rs:130)
| | ->41.76% (89,112,576B) 0x722C0D: tokio_threadpool::worker::Worker::run_task2 (mod.rs:567)
| | ->41.76% (89,112,576B) 0x7224D2: tokio_threadpool::worker::Worker::run_task (mod.rs:459)
| | ->41.44% (88,432,640B) 0x721B51: tokio_threadpool::worker::Worker::try_run_owned_task (mod.rs:390)
| | | ->41.44% (88,432,640B) 0x72158B: tokio_threadpool::worker::Worker::try_run_task (mod.rs:297)
| | |
| | ->00.32% (679,936B) in 1+ places, all below ms_print's threshold (01.00%)
| |
| ->00.00% (42B) in 1+ places, all below ms_print's threshold (01.00%)
Here's the full massif profile: massif.out.zip
Note: The above massif profile is from a cargo build without release optimisations which caused the memory levels to stay much lower. Running this on release builds atleast double the usage as compared to the debug builds.
Edit:
I'm attaching here, another massif profile (release build) that's probably more interesting:
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
79 7,621,402,146 408,363,480 407,058,828 1,304,652 0
99.68% (407,058,828B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->41.75% (170,500,096B) 0x184EC8: hyper::proto::h1::io::Buffered<T,B>::new (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x186BAC: <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as futures::future::Future>::poll (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x2D3CB0: futures::task_impl::std::set (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x2D48A4: futures::task_impl::Spawn<T>::poll_future_notify (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x2D78DF: _ZN3std9panicking3try7do_call17hf61819960f116aefE.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
| ->41.75% (170,500,096B) 0x2D7E7E: tokio_threadpool::task::Task::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x2D0463: tokio_threadpool::worker::Worker::run_task (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->39.95% (163,135,488B) 0x2CF95C: tokio_threadpool::worker::Worker::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2B3705: tokio_trace_core::dispatcher::with_default (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2B48F4: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2B4ABC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2B4C42: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2B8015: tokio::runtime::threadpool::builder::Builder::build::{{closure}} (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2D8921: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2D86FC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2DA56F: std::sys_common::backtrace::__rust_begin_short_backtrace (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2D789A: _ZN3std9panicking3try7do_call17h20721f9dd88a1855E.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
| | ->39.95% (163,135,488B) 0x2D259E: <F as alloc::boxed::FnBox<A>>::call_box (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x34715C: std::sys::unix::thread::Thread::new::thread_start (boxed.rs:759)
| | ->39.95% (163,135,488B) 0x487D162: start_thread (pthread_create.c:486)
| | ->39.95% (163,135,488B) 0x49CADED: clone (clone.S:95)
| |
| ->01.80% (7,364,608B) 0x2CFB67: tokio_threadpool::worker::Worker::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2B3705: tokio_trace_core::dispatcher::with_default (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2B48F4: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2B4ABC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2B4C42: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2B8015: tokio::runtime::threadpool::builder::Builder::build::{{closure}} (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2D8921: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2D86FC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2DA56F: std::sys_common::backtrace::__rust_begin_short_backtrace (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2D789A: _ZN3std9panicking3try7do_call17h20721f9dd88a1855E.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
| ->01.80% (7,364,608B) 0x2D259E: <F as alloc::boxed::FnBox<A>>::call_box (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x34715C: std::sys::unix::thread::Thread::new::thread_start (boxed.rs:759)
| ->01.80% (7,364,608B) 0x487D162: start_thread (pthread_create.c:486)
| ->01.80% (7,364,608B) 0x49CADED: clone (clone.S:95)
|
->41.75% (170,500,096B) 0x333E8A: bytes::bytes::Inner::reserve (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x184D6C: hyper::proto::h1::io::Buffered<T,B>::read_from_io (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x18585E: hyper::proto::h1::io::Buffered<T,B>::parse (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x166312: hyper::proto::h1::conn::Conn<I,B,T>::read_head (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x16C6F3: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x1A9B8F: <hyper::server::conn::upgrades::UpgradeableConnection<I,S,E> as futures::future::Future>::poll (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x1875B1: <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as futures::future::Future>::poll (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x2D3CB0: futures::task_impl::std::set (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x2D48A4: futures::task_impl::Spawn<T>::poll_future_notify (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x2D78DF: _ZN3std9panicking3try7do_call17hf61819960f116aefE.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
| ->41.75% (170,500,096B) 0x2D7E7E: tokio_threadpool::task::Task::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x2D0463: tokio_threadpool::worker::Worker::run_task (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->39.95% (163,135,488B) 0x2CF95C: tokio_threadpool::worker::Worker::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2B3705: tokio_trace_core::dispatcher::with_default (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2B48F4: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2B4ABC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2B4C42: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2B8015: tokio::runtime::threadpool::builder::Builder::build::{{closure}} (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2D8921: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2D86FC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2DA56F: std::sys_common::backtrace::__rust_begin_short_backtrace (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x2D789A: _ZN3std9panicking3try7do_call17h20721f9dd88a1855E.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
| | ->39.95% (163,135,488B) 0x2D259E: <F as alloc::boxed::FnBox<A>>::call_box (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| | ->39.95% (163,135,488B) 0x34715C: std::sys::unix::thread::Thread::new::thread_start (boxed.rs:759)
| | ->39.95% (163,135,488B) 0x487D162: start_thread (pthread_create.c:486)
| | ->39.95% (163,135,488B) 0x49CADED: clone (clone.S:95)
| |
| ->01.80% (7,364,608B) 0x2CFB67: tokio_threadpool::worker::Worker::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2B3705: tokio_trace_core::dispatcher::with_default (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2B48F4: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2B4ABC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2B4C42: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2B8015: tokio::runtime::threadpool::builder::Builder::build::{{closure}} (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2D8921: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2D86FC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2DA56F: std::sys_common::backtrace::__rust_begin_short_backtrace (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x2D789A: _ZN3std9panicking3try7do_call17h20721f9dd88a1855E.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
| ->01.80% (7,364,608B) 0x2D259E: <F as alloc::boxed::FnBox<A>>::call_box (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->01.80% (7,364,608B) 0x34715C: std::sys::unix::thread::Thread::new::thread_start (boxed.rs:759)
| ->01.80% (7,364,608B) 0x487D162: start_thread (pthread_create.c:486)
| ->01.80% (7,364,608B) 0x49CADED: clone (clone.S:95)
Full massif profile [release build]:
massif.release.out.zip
Thanks for the detailed writeup! I have some follow-up questions:
- I see the tool's name is
h2load
, is it testing HTTP2? The stacks look like HTTP1, so I wanted to clarify. - Did you notice this recently in newer versions? It'd be useful to know if this occurred in older versions of hyper or tokio.
I see the tool's name is h2load, is it testing HTTP2? The stacks look like HTTP1, so I wanted to clarify.
Oh, this specific one is just HTTP1. (Since I'm just spinning up a basic server without tls defaulting to http1). But it happens with HTTP2 as well - remember noticing the problem first running it on warp with tls. However, let me verify this again soon on HTTP2.
Did you notice this recently in newer versions? It'd be useful to know if this occurred in older versions of hyper or tokio.
Unfortunately, this is the only version I've tested so far. Will update in case I happen to able to run some more tests soon.
Thanks for the extra info! I've looked around, and so far nothing jumps out at me. It seems everything is dropped correctly when the connection is closed. Which makes me suspicious if it's just the memory allocator holding onto the blocks instead of eagerly releasing back to the OS. I'll try some other log configs to see what else I can learn.
@seanmonstar I think those Massif profiles indicate that the memory is held by some Buffered
instances (assuming they're taken after everything got dropped).
Which makes me suspicious if it's just the memory allocator holding onto the blocks instead of eagerly releasing back to the OS
One quick way to rule this out would be to test this with a raw tokio TCP socket. However..
I think those Massif profiles indicate that the memory is held by some Buffered instances (assuming they're taken after everything got dropped).
I think @lnicola is right. While it's a reasonably accurate indication that, it is the Buffered
instances that are having left overs - just not sure when and due to what reason they are being left over.
Apologies, I was unfamiliar with Massif, I thought was just saying where memory has been allocated before. If it's saying that memory isn't being de-allocated, then it seems the task owning that memory is just hanging around in the executor. This could either be due to a flaw in the executor (less likely), or a specific case causing hyper's connection task to stop polling even though the connection closed (more likely).
I've been using this patch to monitor if the tasks are dropped, and so far with hyper's hello world example, it always drops back to 0. There may be something where a connection hangs up while hyper is processing a request body or maybe sitting in idle mode, but I've so far not found how to trigger it.
Could it be related to half-closed connections? Or does that not apply to HTTP/2?
It could, by default hyper allows half-closed from the client. If there's a response to write, though, that should either finish the connection and close, or notice the EPIPE
if streaming a big response.
Apologies, I was unfamiliar with Massif, I thought was just saying where memory has been allocated before.
@seanmonstar I don't think you're completely off from how massif works. AFAIK, it primarily relies on collecting allocations through a time period. The reason I think it's a reasonably accurate indication is due to correlation. This path way seems to be the only one that generates remotely enough allocations, and it matches up with the overall memory usage.
What I pasted into the issue is just one small part of the profile that I thought was relevant. If you look at the entire profile and run it with either massif visualizer
gui or ms_print
on the commandline, that should provide you with a better idea.
I've been using this patch to monitor if the tasks are dropped, and so far with hyper's hello world example, it always drops back to 0. There may be something where a connection hangs up while hyper is processing a request body or maybe sitting in idle mode, but I've so far not found how to trigger it.
I just tried this. Same result. Drops back to 0, but memory usage seems to increase as usual.
Just thinking aloud looking at this - could there be any pooled memory that's being reused that's increasing it's capacity on load but not trimming itself again? More precisely Vec
, or Bytes
that's being reused but increasing it's capacity without trimming back down?
The buffer inside Buffered
doesn't ever shrink itself, it just has a limit on its growth, and only growing on a heuristic that watches for individual reads filling the whole buffer. But that buffer is owned by the task, and since the logs seem to suggest all the tasks are being dropped, it should be freed.
Since the memory was reaching a large size, and not freeing it, but not growing bigger if the benchmark is run against the same running server, that made me assume that something is just reusing memory, and if the task is dropped and it's being freed, I started to suspect the allocator. But if it's only happening in hyper and not with some similar buffered TCP server in Rust, than it is likely a bug somewhere in hyper. I just don't yet know where.
@seanmonstar Any update on this issue?
@ccgogo123 latest status is all there. Are you experiencing this? Could be helpful to dig more.
If not, I actually was thinking of closing as there doesn't seem to be much more to do (I don't experience a leak myself).
@seanmonstar What I'm experiencing is that the rusoto_s3 client doesn't release the memory under high concurrency. I happened to see this issue and suspect the root cause might be this.
@ccgogo123 - Would be helpful to see if you can eliminate tokio threadpool as the cause. Are you able to provide any insight?
@seanmonstar - Could you hold it off until Jan 2020? I should be able to get some free time mid Jan, and would like to take another stab at digging in, and also check on newer std future based versions.
Hi, I am experiencing this as well in https://github.com/gauteh/dars. This is streaming netCDF files using the DAP protocol. In this case it keeps increasing with sequential requests as well, without ever going down. I am using hyper directly with tokio. I have not been able to isolate the leak in the netcdf reading library, so I suspect I am seeing this issue.
@seanmonstar Any further update on this issue? I'm still facing this.
I have tried this with Warp 0.2 and tide 0.8.0 , and with high concurrent load the memory keeps rising but is not released even hours after load drops.
FYI - I used the following for loadtest:
wrk -t12 -c400 -d10s <url>
The tests are pretty basic. For example, this is the hello-world code for warp:
use warp::Filter;
#[tokio::main]
async fn main() {
// GET /hello/warp => 200 OK with body "Hello, warp!"
let hello = warp::path!("hello" / String)
.map(|name| format!("Hello, {}!", name));
warp::serve(hello)
.run(([127, 0, 0, 1], 3030))
.await;
}
Test run on MacOS Mojave.
Cargo version 1.43.0
As stated before, all updates are in the issue.
I have not been able to reproduce it myself. It's difficult to pin point that this is actually leaked memory. It can also be a case of the allocator not releasing back to the OS. Does the behavior exist if you use jemalloc instead of the system allocator?
I'm inclined to close this (really this time) unless someone would like to dig more into this themselves, as I can't trigger it.
I also have this problem. I'm concerned because if I put my private app into our production system (our peak is 1M+ simultaneous connections), I think I'm guaranteed to have a memory leak.
I've created a repo to maybe? help with reproducing the problem: https://github.com/brianbruggeman/warp-mem-leak-test
Getting started:
git clone https://github.com/brianbruggeman/warp-mem-leak-test.git warp-mem-leaktest
cd warp-mem-leaktest
./test-me.sh
Environment variables
HOST
- the host name to use (default: 127.0.0.1)PORT
- the port name to use (default: 3030)DURATION
- the duration of the load test (default: 3 seconds)WORKERS
- the number of simultaneous workers (default: 100)
Highlights:
- I can see this problem with a single load test (using hey) over 1 second with 1 worker on my machine.
- I'm not using anything particularly special (just warp and full featured tokio).
- I'm not running with any specific release or compiler settings
- I'm using the bash command, ps, and the rss field to capture change in kilobytes
- The first iteration always seems to be the largest
- Subsequent iterations may report 0 KB changes, but I cannot know if that's due to precision error (rss is in KB) or if it's really 0
- I'm not consistently seeing values; each invocation has a different set of values (though they are similar).
hi @brianbruggeman have you read #1790 (comment), tried with jemalloc instead of the system allocator?
Edit:
Assuming I correctly set the allocator, then I still see the same problems.
For my jemalloc test, I added jemallocator-global.
I also created a branch specifically for this test: https://github.com/brianbruggeman/warp-mem-leak-test/tree/jemalloc
Let me jump in here - I can confirm @brianbruggeman's inference on allocators. I had switched allocator in the past and don't recall it showing any difference.
(Sorry for not following up before - ended up with other life priorities unfortunately)
@seanmonstar would you mind stating your platform, OS, mem etc? I'm curious to see if that can help dig in further since you don't seem to be running into this, interestingly (is this still the case?)
@prasannavl My next thought was to use a docker image, since a container is more portable and if we can provide a docker image that definitely has this leak, then we can all use the same platform.
Maybe the easiest would be to test on an OpenBSD container: https://stackoverflow.com/questions/2215259/will-malloc-implementations-return-free-ed-memory-back-to-the-system?rq=1 ?
running @brianbruggeman test-case (which wrongfully suggests rss increase as memory leaks) multiple times against the same process, process rss didn't exceed a certain size:
----------------------------------------
Settings:
----------------------------------------
Iterating 10 times
Loading 127.0.0.1:3030
Running for 3s each iteration
Using 100 workers
----------------------------------------
===============================
Iteration: 1 of 10
-------------------------------
Requests/sec: 48387.9238
Latency (worst case): 0.0072 secs
-------------------------------
Starting memory: 11448 KB
Ending memory: 11236 KB
Memory leaked: -212 KB
===============================
Iteration: 2 of 10
-------------------------------
Requests/sec: 69314.8785
Latency (worst case): 0.0041 secs
-------------------------------
Starting memory: 11236 KB
Ending memory: 11312 KB
Memory leaked: 76 KB
===============================
Iteration: 3 of 10
-------------------------------
Requests/sec: 69045.5330
Latency (worst case): 0.0041 secs
-------------------------------
Starting memory: 11312 KB
Ending memory: 11040 KB
Memory leaked: -272 KB
===============================
Iteration: 4 of 10
-------------------------------
Requests/sec: 65886.4996
Latency (worst case): 0.0042 secs
-------------------------------
Starting memory: 11040 KB
Ending memory: 10864 KB
Memory leaked: -176 KB
===============================
Iteration: 5 of 10
-------------------------------
Requests/sec: 64415.6591
Latency (worst case): 0.0042 secs
-------------------------------
Starting memory: 10864 KB
Ending memory: 11004 KB
Memory leaked: 140 KB
===============================
Iteration: 6 of 10
-------------------------------
Requests/sec: 58950.3058
Latency (worst case): 0.0054 secs
-------------------------------
Starting memory: 11004 KB
Ending memory: 10244 KB
Memory leaked: -760 KB
===============================
Iteration: 7 of 10
-------------------------------
Requests/sec: 43750.1205
Latency (worst case): 0.0095 secs
-------------------------------
Starting memory: 10244 KB
Ending memory: 10420 KB
Memory leaked: 176 KB
===============================
Iteration: 8 of 10
-------------------------------
Requests/sec: 60437.2641
Latency (worst case): 0.0047 secs
-------------------------------
Starting memory: 10420 KB
Ending memory: 10560 KB
Memory leaked: 140 KB
===============================
Iteration: 9 of 10
-------------------------------
Requests/sec: 57798.8734
Latency (worst case): 0.0049 secs
-------------------------------
Starting memory: 10560 KB
Ending memory: 10276 KB
Memory leaked: -284 KB
===============================
Iteration: 10 of 10
-------------------------------
Requests/sec: 46854.6004
Latency (worst case): 0.0106 secs
-------------------------------
Starting memory: 10276 KB
Ending memory: 10612 KB
Memory leaked: 336 KB
===============================
----------------------------------------
Settings:
----------------------------------------
Iterating 10 times
Loading 127.0.0.1:3030
Running for 3s each iteration
Using 100 workers
----------------------------------------
===============================
Iteration: 1 of 10
-------------------------------
Requests/sec: 69239.5438
Latency (worst case): 0.0041 secs
-------------------------------
Starting memory: 10612 KB
Ending memory: 10936 KB
Memory leaked: 324 KB
===============================
Iteration: 2 of 10
-------------------------------
Requests/sec: 68014.9964
Latency (worst case): 0.0042 secs
-------------------------------
Starting memory: 10936 KB
Ending memory: 10812 KB
Memory leaked: -124 KB
===============================
Iteration: 3 of 10
-------------------------------
Requests/sec: 65241.9599
Latency (worst case): 0.0044 secs
-------------------------------
Starting memory: 10812 KB
Ending memory: 11176 KB
Memory leaked: 364 KB
===============================
Iteration: 4 of 10
-------------------------------
Requests/sec: 64965.6894
Latency (worst case): 0.0044 secs
-------------------------------
Starting memory: 11176 KB
Ending memory: 11488 KB
Memory leaked: 312 KB
===============================
Iteration: 5 of 10
-------------------------------
Requests/sec: 64656.9231
Latency (worst case): 0.0044 secs
-------------------------------
Starting memory: 11488 KB
Ending memory: 11156 KB
Memory leaked: -332 KB
===============================
Iteration: 6 of 10
-------------------------------
Requests/sec: 62959.1025
Latency (worst case): 0.0045 secs
-------------------------------
Starting memory: 11156 KB
Ending memory: 11476 KB
Memory leaked: 320 KB
===============================
Iteration: 7 of 10
-------------------------------
Requests/sec: 63990.3273
Latency (worst case): 0.0043 secs
-------------------------------
Starting memory: 11476 KB
Ending memory: 11748 KB
Memory leaked: 272 KB
===============================
Iteration: 8 of 10
-------------------------------
Requests/sec: 61384.1161
Latency (worst case): 0.0055 secs
-------------------------------
Starting memory: 11748 KB
Ending memory: 11764 KB
Memory leaked: 16 KB
===============================
Iteration: 9 of 10
-------------------------------
Requests/sec: 65772.4001
Latency (worst case): 0.0041 secs
-------------------------------
Starting memory: 11764 KB
Ending memory: 11460 KB
Memory leaked: -304 KB
===============================
Iteration: 10 of 10
-------------------------------
Requests/sec: 60818.1767
Latency (worst case): 0.0052 secs
-------------------------------
Starting memory: 11460 KB
Ending memory: 11552 KB
Memory leaked: 92 KB
===============================
----------------------------------------
Settings:
----------------------------------------
Iterating 10 times
Loading 127.0.0.1:3030
Running for 3s each iteration
Using 100 workers
----------------------------------------
===============================
Iteration: 1 of 10
-------------------------------
Requests/sec: 69513.3845
Latency (worst case): 0.0043 secs
-------------------------------
Starting memory: 11552 KB
Ending memory: 11312 KB
Memory leaked: -240 KB
===============================
Iteration: 2 of 10
-------------------------------
Requests/sec: 68651.8908
Latency (worst case): 0.0041 secs
-------------------------------
Starting memory: 11312 KB
Ending memory: 11940 KB
Memory leaked: 628 KB
===============================
Iteration: 3 of 10
-------------------------------
Requests/sec: 66253.8193
Latency (worst case): 0.0043 secs
-------------------------------
Starting memory: 11940 KB
Ending memory: 11496 KB
Memory leaked: -444 KB
===============================
Iteration: 4 of 10
-------------------------------
Requests/sec: 63073.4260
Latency (worst case): 0.0047 secs
-------------------------------
Starting memory: 11496 KB
Ending memory: 11412 KB
Memory leaked: -84 KB
===============================
Iteration: 5 of 10
-------------------------------
Requests/sec: 41630.8405
Latency (worst case): 0.0098 secs
-------------------------------
Starting memory: 11412 KB
Ending memory: 11516 KB
Memory leaked: 104 KB
===============================
Iteration: 6 of 10
-------------------------------
Requests/sec: 64138.8079
Latency (worst case): 0.0044 secs
-------------------------------
Starting memory: 11516 KB
Ending memory: 11800 KB
Memory leaked: 284 KB
===============================
Iteration: 7 of 10
-------------------------------
Requests/sec: 65103.9589
Latency (worst case): 0.0043 secs
-------------------------------
Starting memory: 11800 KB
Ending memory: 11800 KB
Memory leaked: 0 KB
===============================
Iteration: 8 of 10
-------------------------------
Requests/sec: 64948.5499
Latency (worst case): 0.0044 secs
-------------------------------
Starting memory: 11800 KB
Ending memory: 12272 KB
Memory leaked: 472 KB
===============================
Iteration: 9 of 10
-------------------------------
Requests/sec: 64483.5093
Latency (worst case): 0.0043 secs
-------------------------------
Starting memory: 12272 KB
Ending memory: 12400 KB
Memory leaked: 128 KB
===============================
Iteration: 10 of 10
-------------------------------
Requests/sec: 65063.2982
Latency (worst case): 0.0043 secs
-------------------------------
Starting memory: 12400 KB
Ending memory: 11956 KB
Memory leaked: -444 KB
===============================
----------------------------------------
Settings:
----------------------------------------
Iterating 10 times
Loading 127.0.0.1:3030
Running for 3s each iteration
Using 100 workers
----------------------------------------
===============================
Iteration: 1 of 10
-------------------------------
Requests/sec: 71430.5112
Latency (worst case): 0.0039 secs
-------------------------------
Starting memory: 11956 KB
Ending memory: 11356 KB
Memory leaked: -600 KB
===============================
Iteration: 2 of 10
-------------------------------
Requests/sec: 68816.3673
Latency (worst case): 0.0042 secs
-------------------------------
Starting memory: 11356 KB
Ending memory: 11292 KB
Memory leaked: -64 KB
===============================
Iteration: 3 of 10
-------------------------------
Requests/sec: 67381.1404
Latency (worst case): 0.0043 secs
-------------------------------
Starting memory: 11292 KB
Ending memory: 11500 KB
Memory leaked: 208 KB
===============================
Iteration: 4 of 10
-------------------------------
Requests/sec: 65377.7233
Latency (worst case): 0.0045 secs
-------------------------------
Starting memory: 11500 KB
Ending memory: 11492 KB
Memory leaked: -8 KB
===============================
Iteration: 5 of 10
-------------------------------
Requests/sec: 61778.3874
Latency (worst case): 0.0049 secs
-------------------------------
Starting memory: 11492 KB
Ending memory: 11068 KB
Memory leaked: -424 KB
===============================
Iteration: 6 of 10
-------------------------------
Requests/sec: 61428.5839
Latency (worst case): 0.0047 secs
-------------------------------
Starting memory: 11068 KB
Ending memory: 11372 KB
Memory leaked: 304 KB
===============================
Iteration: 7 of 10
-------------------------------
Requests/sec: 61004.0751
Latency (worst case): 0.0048 secs
-------------------------------
Starting memory: 11372 KB
Ending memory: 11972 KB
Memory leaked: 600 KB
===============================
Iteration: 8 of 10
-------------------------------
Requests/sec: 60177.3556
Latency (worst case): 0.0049 secs
-------------------------------
Starting memory: 11972 KB
Ending memory: 11360 KB
Memory leaked: -612 KB
===============================
Iteration: 9 of 10
-------------------------------
Requests/sec: 56961.2634
Latency (worst case): 0.0053 secs
-------------------------------
Starting memory: 11360 KB
Ending memory: 11600 KB
Memory leaked: 240 KB
===============================
Iteration: 10 of 10
-------------------------------
Requests/sec: 61224.8200
Latency (worst case): 0.0049 secs
-------------------------------
Starting memory: 11600 KB
Ending memory: 11540 KB
Memory leaked: -60 KB
===============================
----------------------------------------
Settings:
----------------------------------------
Iterating 10 times
Loading 127.0.0.1:3030
Running for 3s each iteration
Using 100 workers
----------------------------------------
===============================
Iteration: 1 of 10
-------------------------------
Requests/sec: 67056.4463
Latency (worst case): 0.0043 secs
-------------------------------
Starting memory: 11540 KB
Ending memory: 11964 KB
Memory leaked: 424 KB
===============================
Iteration: 2 of 10
-------------------------------
Requests/sec: 66026.3226
Latency (worst case): 0.0042 secs
-------------------------------
Starting memory: 11964 KB
Ending memory: 12140 KB
Memory leaked: 176 KB
===============================
Iteration: 3 of 10
-------------------------------
Requests/sec: 66627.3795
Latency (worst case): 0.0044 secs
-------------------------------
Starting memory: 12140 KB
Ending memory: 12108 KB
Memory leaked: -32 KB
===============================
Iteration: 4 of 10
-------------------------------
Requests/sec: 66494.5235
Latency (worst case): 0.0041 secs
-------------------------------
Starting memory: 12108 KB
Ending memory: 11924 KB
Memory leaked: -184 KB
===============================
Iteration: 5 of 10
-------------------------------
Requests/sec: 55494.9673
Latency (worst case): 0.0061 secs
-------------------------------
Starting memory: 11924 KB
Ending memory: 11860 KB
Memory leaked: -64 KB
===============================
Iteration: 6 of 10
-------------------------------
Requests/sec: 46438.2939
Latency (worst case): 0.0071 secs
-------------------------------
Starting memory: 11860 KB
Ending memory: 12332 KB
Memory leaked: 472 KB
===============================
Iteration: 7 of 10
-------------------------------
Requests/sec: 65939.5318
Latency (worst case): 0.0041 secs
-------------------------------
Starting memory: 12332 KB
Ending memory: 12436 KB
Memory leaked: 104 KB
===============================
Iteration: 8 of 10
-------------------------------
Requests/sec: 64543.1936
Latency (worst case): 0.0044 secs
-------------------------------
Starting memory: 12436 KB
Ending memory: 12412 KB
Memory leaked: -24 KB
===============================
Iteration: 9 of 10
-------------------------------
Requests/sec: 65390.4177
Latency (worst case): 0.0041 secs
-------------------------------
Starting memory: 12412 KB
Ending memory: 12596 KB
Memory leaked: 184 KB
===============================
Iteration: 10 of 10
-------------------------------
Requests/sec: 64624.9731
Latency (worst case): 0.0042 secs
-------------------------------
Starting memory: 12596 KB
Ending memory: 11808 KB
Memory leaked: -788 KB
===============================
----------------------------------------
Settings:
----------------------------------------
Iterating 10 times
Loading 127.0.0.1:3030
Running for 3s each iteration
Using 100 workers
----------------------------------------
===============================
Iteration: 1 of 10
-------------------------------
Requests/sec: 72250.3533
Latency (worst case): 0.0038 secs
-------------------------------
Starting memory: 11808 KB
Ending memory: 11624 KB
Memory leaked: -184 KB
===============================
Iteration: 2 of 10
-------------------------------
Requests/sec: 70754.5880
Latency (worst case): 0.0038 secs
-------------------------------
Starting memory: 11624 KB
Ending memory: 11512 KB
Memory leaked: -112 KB
===============================
Iteration: 3 of 10
-------------------------------
Requests/sec: 65921.8890
Latency (worst case): 0.0046 secs
-------------------------------
Starting memory: 11512 KB
Ending memory: 11608 KB
Memory leaked: 96 KB
===============================
Iteration: 4 of 10
-------------------------------
Requests/sec: 64142.0923
Latency (worst case): 0.0043 secs
-------------------------------
Starting memory: 11608 KB
Ending memory: 11408 KB
Memory leaked: -200 KB
===============================
Iteration: 5 of 10
-------------------------------
Requests/sec: 63555.2797
Latency (worst case): 0.0044 secs
-------------------------------
Starting memory: 11408 KB
Ending memory: 11760 KB
Memory leaked: 352 KB
===============================
Iteration: 6 of 10
-------------------------------
Requests/sec: 57089.7501
Latency (worst case): 0.0058 secs
-------------------------------
Starting memory: 11760 KB
Ending memory: 12104 KB
Memory leaked: 344 KB
===============================
Iteration: 7 of 10
-------------------------------
Requests/sec: 48886.2883
Latency (worst case): 0.0069 secs
-------------------------------
Starting memory: 12104 KB
Ending memory: 11560 KB
Memory leaked: -544 KB
===============================
Iteration: 8 of 10
-------------------------------
Requests/sec: 48431.9411
Latency (worst case): 0.0065 secs
-------------------------------
Starting memory: 11560 KB
Ending memory: 12204 KB
Memory leaked: 644 KB
===============================
Iteration: 9 of 10
-------------------------------
Requests/sec: 56847.6384
Latency (worst case): 0.0058 secs
-------------------------------
Starting memory: 12204 KB
Ending memory: 11816 KB
Memory leaked: -388 KB
===============================
Iteration: 10 of 10
-------------------------------
Requests/sec: 63177.3175
Latency (worst case): 0.0047 secs
-------------------------------
Starting memory: 11816 KB
Ending memory: 11936 KB
Memory leaked: 120 KB
===============================
furthermore when attaching a tool like Heaptrack to warp
's process you can see there are no memory leaks:
amidst 121.25 GB
allocated, only 2.6 MB
are reported as leaked, and when you check their location, you can see it's Buffer's
(tests were run using the default allocator on Linux 5.6.12-arch1-1)
@jxs This is great. Let's not argue over rss; it is a measurement and you can clearly see reductions on your side. It's obviously not the best measurement.
On my testing, I absolutely never see a reduction in memory (I've tested as high as 5000 iterations). Where as your test clearly shows negative values.
Can you please detail your machine testing environment?
As a note, I'm using macos... heaptrack will only run on Linux...
But if we were to setup a dockerfile with arch (or likely just the standard rust:latest), I wonder if I would see a similar output.
Hello there!
I have the same problem with async-graphql
+ warp
. On Linux, memory doesn't return to the system. But if I launch program on Windows, memory return to normal (docker-desktop on Windows doesn't return memory too). jemalloc
doesn't solve a problem.
I'm facing the same problem with tonic.
This can be reproduced on Ubuntu, but not on Windows.
hyperium/tonic#797
Any updates on this one? It's still relevant please
@oronsh My general impression is that the maintainers don't feel there is an issue here. In other words, we (as a community) still have not proven that there is a problem and we have not been able to provide a repeatable environment to show the issue.
And as a note, this issue is in a closed.... not to say it couldn't be reopened, but I think we have to provide clear evidence that can be repeated rather than screenshots.
I can reproduce similar behavior with the system allocator as well as mimalloc
; but if we're going to fix this, we'll need to get a lot more detailed about the specific behavior that exasperates the problem.
I'm curious if this can be reproduced in an application that uses Tokio but not Hyper. Hyper is going to spawn a task for every connection (basically). If we write a program that spawns 10s of thousands of tasks concurrently and then sits for some amount of time does the OS reclaim memory or does the process's RSS remain at its peak?
It seems related to the allocator. After I switch to Jemalloc, memory reduces when the client side send less requests. (triggered by sending new but less requests. If not new requests sent, memory won't reduce)
@whfuyn I switched to jemalloc
too, but I do seem a little problem if for exmaple I send a lot of connections at the same time, I see that the memory increases a bit... It could be 3mb it could be 10mb or more.
@oronsh My general impression is that the maintainers don't feel there is an issue here. In other words, we (as a community) still have not proven that there is a problem and we have not been able to provide a repeatable environment to show the issue.
And as a note, this issue is in a closed.... not to say it couldn't be reopened, but I think we have to provide clear evidence that can be repeated rather than screenshots.
Yes maybe it's kind of an edge case and maybe people are okay with restarting their server once in a while...
It's really up to us to prove it...
I can reproduce similar behavior with the system allocator as well as
mimalloc
; but if we're going to fix this, we'll need to get a lot more detailed about the specific behavior that exasperates the problem.I'm curious if this can be reproduced in an application that uses Tokio but not Hyper. Hyper is going to spawn a task for every connection (basically). If we write a program that spawns 10s of thousands of tasks concurrently and then sits for some amount of time does the OS reclaim memory or does the process's RSS remain at its peak?
I also think that it's more of a tokio
issue rather than hyper
and it will be very interesting to try what you've suggested. I noticed that it happens to me only when I'm doing some expensive operation, like serializing huge payload before sending it back to the client.
My case is that on requests, I fetch some huge payload from some other server (with reqwest).
If I don't do anything with the result and just send back to the client hello
for example, the memory will stay alright.
If I try to deserialize the payload with serde_json
the memory will stack up little by little per every request.
Something like this:
pub async fn get_data_route(Extension(client): Extension<Arc<MyClient>>) -> impl IntoResponse {
// fetching huge payload
let res = client
.body(...)
.send()
.await
.unwrap();
// without this part the memory seems to be stable
let data = res.json::<Foo>().await.unwrap();
(StatusCode::OK, Json("hello world"))
}
or it could be an issue with hyper
here: https://github.com/hyperium/hyper/blob/master/src/body/to_bytes.rs#L46
since it's been called from the json
method.
It could also be that there's a general problem with spawning tasks, I'm wondering how it works with tokio_uring
I'll just share some of the data we have, as it may be helpful in narrowing down a smaller repro. I have some load tests for linkerd-proxy
. The proxy itself is pretty complicated and probably doesn't serve as the best repro -- we'll want something a lot smaller.
We have a load test instrumented like so:
[ http/1 load generator ] -> [ client proxy (http/1 hyper server) ] -> [ client proxy (h2 hyper client) ] ->
-> [ server proxy (h2 hyper server) ] -> [ server proxy (http/1 hyper client) ] -> [ http/1 test server ]
When running the client with steadily increasing (and then resetting) load we see memory use grow steadily, basically in lock-step with the number of server connections. Memory is not released when load resets:
On the server-side, however, memory usage basically remains steady.
The server uses far fewer TCP connections, but it also creates far fewer Hyper servers.
I can instrument this test to run with pure TCP connections and no Hyper servers/clients. I'll report back when I have some data points there. At the moment, it appears as if our memory costs are correlated with instantation of per-connection Hyper servers, though... That might be something to focus on as we try to get a narrower reproduction.
I've changed the test scenario so that we use a single proxy:
[ http/1 load generator ] ->
-> [ server proxy (http/1 hyper server) ] -> [ server proxy (http/1 hyper client) ] -> [ http/1 test server ]
And in this scenario:
[ http/1 load generator ] ->
-> [ server proxy (TCP tokio server) ] -> [ server proxy (TCP tokio client) ] -> [ http/1 test server ]
We can see in both cases that the behavior is roughly similar: memory usage never decreases. The same load profile is used in both cases. So the proxy is transporting the same amount of raw bytes. Though, in the HTTP case we use dramatically more more memory--presumably due to more allocation.
In this case, we're using mimalloc
, though we've seen similar behavior with the system allocator (and, I believe, jemalloc
as well).
Looking at the /proc
filesystem confirms that all of this heap allocated
:; sudo cat /proc/$pid/smaps | grep heap -A11
56023f5e2000-56026092c000 rw-p 00000000 00:00 0 [heap]
Size: 544040 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 541300 kB
Pss: 541300 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 541300 kB
Referenced: 541300 kB
Anonymous: 541300 kB
I'll try to get this running through heap track, though in the past it has shown wildly different results (seemingly underreporting used memory).
Here's where things get weird. Using the system allocator and using heaptrack we see the 'leaky behavior' as usual:
But heaptrack reports a very different picture:
Heaptrack claims that there only 5.4MB of 'leaked' allocations remaining at the end of the run, while the procinfo
crate (and manually inspecting the status in /proc) reports 535MB of resident memory.
I'm assuming this indicates that this memory is actually freed, but the operating system doesn't recognize that?
I suppose I'll test jemalloc for completeness...
High performance memory allocators are built to allocate memory quickly, not to keep RSS at the minimum possible value. They generally operate under the assumption that if you have allocated a bunch of memory recently, you will want to allocate a bunch of memory again in the future and so hold onto the pages for some period of time.
If you actually want to look for leaks you should be asking the allocator how many bytes it has actually allocated to the program. In jemalloc, this can be done with the jemalloc-ctl crate: https://docs.rs/jemalloc-ctl/0.3.3/jemalloc_ctl/stats/struct.allocated.html.
If it does appear that the application is explicitly holding onto allocated memory for too long, jemalloc also has profiling features you can use to see what sections of code are performing those allocations: https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Leak-Checking
When I tried jemalloc
configured with JEMALLOC_SYS_WITH_MALLOC_CONF=background_thread:true,narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0,abort_conf:true
the memory released just fine but I can still see a little increase overtime with a lot of requests.
As stated before, all updates are in the issue.
I have not been able to reproduce it myself. It's difficult to pin point that this is actually leaked memory. It can also be a case of the allocator not releasing back to the OS. Does the behavior exist if you use jemalloc instead of the system allocator?
I'm inclined to close this (really this time) unless someone would like to dig more into this themselves, as I can't trigger it.
@seanmonstar I ran into this issue with Tonic 0.7.2 using 0.14.19 under tokio 1.19.2. The behavior was exactly the same with the system allocator and with jemalloc: the more load (requests/second) you put on the server, the more memory it allocates. That memory is not freed. Decreased load is handled without further allocation, but once the load is increases about the former level, additional memory gets allocated.
Did anyone ever resolve this?
Was there any mitigation for this?
I encountered the same problem as @hseeberger.It could not work even I switched another allocator.
I think the leak is pretty well established by this point - just that I don't think anyone (including myself) has had the time enough to dig in to get to the exact source, rather than simpler outer workarounds.
I think @olix0r made an interesting point above - if this can be reproduced by just tokio. Not sure anyone tried this. Would be interested to see this across different fixed set of worker thread pool. Think that can help clarify the domain.
I don't thinkd the issue should be closed without a solution .
I was having the same issue and switched to the MiMalloc allocator. It releases memory instantly - from 2GB when underload back to 50MB when idle, constrast to the default allocator which peaks at 3GB and 1.7GB when idle. MiMalloc does not only consume less memory but also performs better in my case, bench tested both on my Linux machine and Docker container with resource limited to 1CPU 1GB. I think I'm going to stick with MiMalloc for now.
@wonrax this is interesting. Was there any specific config you had used for MiMalloc?
I re-call seeing these comments: #1790 (comment) earlier by @olix0r where mimalloc doesn't seem to have helped. Any many more comments about related to jemalloc.
Curious if this was a specific config or mimalloc itself had some tweaks that seems to have better behavior.
I didn't set or tweak any config, the code only imported and set mimalloc as the global allocator
- Test Preparation:
I tested it in a virtual machine of VMware, with the environment being Ubuntu 24.04
desktop version, allocated 2.6GHz 8C8G.
Rust is the latest version rustc 1.81.0-nightly (b5b13568f 2024-06-10)
, in --release
mode.
All terminals are set to the maximum file number with ulimit -n 65536
.
Direct access to http under the same environment.
For the server-side axum, I wrote a most basic 'hello', and tested three memory allocation modes mentioned in the previous post: the default memory allocator, MiMalloc and JeMalloc.
First, I tested the performance with oha
on the default mode, adjusted the oha client concurrency parameters, and reported that axum can reach a maximum of 130,000+ QPS.
The plan is to use reqwest as a client to hit axum with 1000/10000 QPS within seconds, and use tokio's interval to hit it once every second, continuously for more than a minute.
The code is the official example, all tokio asynchronous, no synchronous blocking code, and the report prints once a second, which basically does not affect the concurrency performance.
- Test Results:
When starting, the MiMalloc mode has more than 400+K memory, and the default allocator, JeMalloc are all less than 300K memory.
When the client continues at 1000 QPS, after a few seconds, the memory of MiMalloc will be around 30+M, and the others are around 50~60M.
When the client continues at 10000 QPS, after a few seconds, the memory of MiMalloc will be around 300+M, and the others are around 500~600M.
When the client pressure test ends, the server-side of all memory allocation modes will maintain high memory usage, and will not automatically release memory in a short time if there are no new requests.
However, sending requests at a low frequency (intermittently 1~100 QPS) will gradually release memory.
- After continuous 1000 QPS and 10000 QPS and intermittent low-frequency requests, the approximate status of each stage during a test process:
Status---(0 QPS)-(1000 QPS)-(10000 QPS)--(10 QPS)
Default: 300K -> 60M -> 600M -> 560M
MiMalloc: 500K -> 35M -> 350M -> 9M
JeMalloc: 300K -> 56M -> 560M -> 30M
- Speculation on the cause of memory issues:
First of all, it must be admitted that axum/hyper does have a memory allocation problem, and after multiple rounds of requests, it does not return to a memory state close to the initial one.
Under the default allocator, using reqwest to hit axum at 1000 QPS for two hours, 720,000 requests, the memory is still around 55~60M.
Memory cannot be correctly recovered, which is suspected to be related to the number of concurrent requests, and has little to do with the total number of requests.
MiMalloc and JeMalloc both release memory better than the default allocator, although the final memory state of the two is still different, but it can be basically considered within the bearable range.
- Some other issues encountered during the test:
Because in the same environment, CPU overload and port congestion will affect both the client and the server at the same time.
Although reqwest used tokio's semaphore to control concurrency, if the instantaneous concurrency is too large, the CPU will be full, and the client side will generate a large number of send errors, the specific reason is unknown, it is speculated that it may be that the TCP port is full.
In this environment, reqwest can send about 28,000 continuously within seconds without errors, but after a few seconds, a large number of send errors will appear, which will block the system's TCP channel, and even if the server is not down, it cannot provide services.
Ultimately, it was decided to test at the maximum of 10,000 QPS, which could handle requests within a second, completing all transactions in about 200~300 ms.
Under the condition that the CPU is not overloaded, there is basically no big difference in the throughput performance of the server side of each memory allocator.
But even at 10000 QPS, after running for a period of time, it is still possible for the CPU to overload and squeeze requests.
Under the continuous 10000 QPS situation,
The default allocator start to run full CPU after about ten seconds, and start to report send errors;
MiMalloc starts to run full CPU after about thirty seconds, and starts to report send errors;
JeMalloc starts to run full CPU after about eighty seconds, and starts to report send errors.
When the CPU is full, the axum server side will be unable to access, and after the client stops sending requests, the server side will return to normal within a few seconds.
- Conclusion:
According to this test process, even if it has memory allocation problems, it seems that the memory will not increase indefinitely,
MiMalloc performed better in memory allocation and release during the test,
JeMalloc performed better in terms of CPU occupancy while taking into account memory release.
...
I tested it in a virtual machine of VMware, with the environment being
Ubuntu 24.04
desktop version, allocated 2.6GHz 8C8G.Status---(0 QPS)-(1000 QPS)-(10000 QPS)--(10 QPS) Default: 300K -> 60M -> 600M -> 560M MiMalloc: 500K -> 35M -> 350M -> 9M JeMalloc: 300K -> 56M -> 560M -> 30M
...
@prasannavl ,Modifying the memory allocator isn't complicated in itself, but I'm unsure about the feasibility of integrating it into the hyper framework.
cargo.toml:
[dependencies]
axum = "0.7.5"
tokio = { version = "1.38.0", features = ["full"] }
mimalloc = "0.1.42"
jemallocator = "0.5.4"
src/main.rs:
...
use axum::Extension;
use axum::{routing::get, Router, extract::State};
use axum::Json;
use tokio;
// use mimalloc::MiMalloc;
// #[global_allocator]
// static GLOBAL: MiMalloc = MiMalloc;
// #[cfg(not(target_env = "msvc"))]
// use jemallocator::Jemalloc;
// #[cfg(not(target_env = "msvc"))]
// #[global_allocator]
// static GLOBAL: Jemalloc = Jemalloc;
#[tokio::main]
async fn main() {
...
}
I have the same issue.
I crated the most basic service and left it under load (3000 rps) for a few days.
The service was consuming more and more memory and wouldn't stop.
I tried mimalloc, but it didn't help.
严格说这个不是bug或者内存泄漏。是系统内存管理策略。使用malloc申请的内存, free 后会将空闲的内存合并成大的内存块,并一定会立马还给操作系统(这块具体可以看 free 源码实现),malloc_trim 就是将空闲的块都还回去。
unsafe {
libc::malloc_trim(0);
}
更换分配器就行。