paritytech/substrate

Memory leak after node finishes syncing with relay chain

Closed this issue ยท 12 comments

NZT48 commented

Is there an existing issue?

  • I have searched the existing issues
    This can be potentially related issues, because they are recent and have memory issues, but this is not the same, no transactions were executed and memory leaking starts after syncing is finished:
  • #11513
  • #11503

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

After finished syncing with Polkadot, parachain node memory usage starts constantly growing until reaches maximum and then relay chain part of node stops working (only parachain logs are visible, no relaychain logs and no connection to relay chain).

At the image below is shown how memory usage starts growing after syncing is finished, node consumes around 32GB of memory in 24h:
memory_usage

Server specs:

  • Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz 16 cores
  • 32GB RAM
  • Ubuntu 20.04

Additional information:

  • The issue is happening when Polkadot 0.9.22 and Polkadot 0.9.23 are used as dependencies for node part
  • For runtime is used Polkadot 0.9.18 because that was latest version when crowdloan was started
  • This issue existed before and after parachain block production started.
  • Node is started using --execution wasm flag for both parachain and relay chain part
  • Using following flags didn't fixed the issue: --db-cache 0 and --pruning=1000
  • Happening to all nodes that are connected to Polkadot, hard to reproduce and only twice seen on nodes that were connected to Rococo, never seen on nodes connected to self-hosted Rococo relay chain
  • Happening to all types of nodes: Collator, Archive, Full, RPC

When starting node, error logs from issue on cumulus repo are logged, but I guess this is not related to this issue.

Steps to reproduce

Compile code from this branch.
Chainspec available here.

But it would be good to get in contact about details of connecting, because this is parachain connected to Polkadot.

bkchr commented

Is this reproducible all the times? Regardless of the machine configuration etc?

NZT48 commented

Yes, it is happening on all our machines, and they have different CPUs, tried on machines from 8-32GB Ram, and using SSD or external volume for storage.

@riusricardo tried to reproduce it on his laptop, but I'm not sure about results.

We are experiencing the same issue at Composable. All the machines, regardless of configuration. Only happening since v0.9.22 though

I have seen similar situation after the most recent Substrate upgrade of our chain on one of several nodes on my machine (consumed over 65G of RAM before I killed it, restart has shown that memory is growing again, so I stopped it permanently), but the rest of the nodes on the same machine and others are fine somehow.

koute commented

I'm trying to reproduce the issue but so far no luck.

If anyone can reproduce this I'd appreciate if you could do the following:

  1. Compile your node binary with debug symbols. (Open Cargo.toml and add debug = true to [profile.release] then run cargo build --release; this might not be necessary, but it's better to have it than not.)
  2. Download and unpack in the same directory where your node binary is: https://github.com/koute/bytehound/releases/download/0.8.0/bytehound-x86_64-unknown-linux-gnu.tgz
  3. Run this:
export MEMORY_PROFILER_LOG=info
export MEMORY_PROFILER_CULL_TEMPORARY_ALLOCATIONS=1
export MEMORY_PROFILER_LOGFILE=memory-profiling_%e_%t_%p.log
export MEMORY_PROFILER_TEMPORARY_ALLOCATION_LIFETIME_THRESHOLD=60000
export MEMORY_PROFILER_TEMPORARY_ALLOCATION_PENDING_THRESHOLD=655360
export MEMORY_PROFILER_GRAB_BACKTRACES_ON_FREE=1
export MEMORY_PROFILER_OUTPUT=memory-profiling_%e_%t_%p.dat
LD_PRELOAD=./libbytehound.so ./your-node your-args

Replace the ./your-node your-args with the command line invocation you use to start your node.

  1. Check that a memory-profiling_*.dat file was created in the same directory and that it's growing in size with time. If it doesn't you did something wrong.
  2. Wait until it runs out of memory.
  3. Upload the memory-profiling_* files somewhere and send me the link.

I recommend doing this on a machine with a lot of memory and a few spare CPU cores, since the profiling also has extra overhead.

NZT48 commented

Thanks @koute, I have started it yesterday and now waiting for node to run out of memory. Can you contact me on element (nzt:matrix.org) so I can provide you the download link for files?

koute commented

@NZT48 Thank you for the data!

I have done preliminary investigation of the problem; there's indeed a memory leak here:

leak

The leak's in an unbound channel for the network service events; the other side of the channel stopped receiving these events (either due to a deadlock, or because it was leaked) so they're just being pushed into the channel until the system runs out of memory.

From what I can see the stream that's returned from here is the problem in substrate/client/network/src/service.rs:

        /// Returns a stream containing the events that happen on the network.
        ///
        /// If this method is called multiple times, the events are duplicated.
        ///
        /// The stream never ends (unless the `NetworkWorker` gets shut down).
        ///
        /// The name passed is used to identify the channel in the Prometheus metrics. Note that the
        /// parameter is a `&'static str`, and not a `String`, in order to avoid accidentally having
        /// an unbounded set of Prometheus metrics, which would be quite bad in terms of memory
        pub fn event_stream(&self, name: &'static str) -> impl Stream<Item = Event> {
                let (tx, rx) = out_events::channel(name);
                let _ = self.to_worker.unbounded_send(ServiceToWorkerMsg::EventStream(tx));
                rx
        }

In this case over 10 million events were pushed into this channel before the node crashed. We should probably put a hard cap on this channel or at least spam with an error if it gets to big. A high enough cap of, say, 1 million, should still make it effectively unbound in normal cases (based on this data it'd have to leak ~2.5GB of data to reach this many items) but it'd make investigating such problems easier in the future.

Anyway, I'll continue investigating this.

koute commented

Here's why it's leaking:

  1. In cumulus/client/relay-chain-inprocess-interface/src/lib.rs we initialize the relay chain node like this with enable_beefy being always unconditionally set to true:
let relay_chain_full_node = polkadot_service::build_full(
    // ...
    true, // This is `enable_beefy`.
    // ...
)?;
  1. In polkadot/node/service/src/lib.rs if BEEFY is enabled we spawn the BEEFY gadget:
if enable_beefy {
    // ...
    let gadget = beefy_gadget::start_beefy_gadget::<_, _, _, _, _>(beefy_params);
    // ...
    task_manager.spawn_handle().spawn_blocking("beefy-gadget", None, gadget);
}
  1. In substrate/client/beefy/src/lib.rs the gadget initializes sc_network_gossip::GossipEngine:
let gossip_engine = sc_network_gossip::GossipEngine::new(
    network,
    protocol_name,
    gossip_validator.clone(),
    None,
);
  1. In substrate/client/network-gossip/src/bridge.rs the GossipEngine creates a stream of network events (this is the channel which I've shown to be leaking):
let network_event_stream = network.event_stream();

and it only polls that event stream when the GossipEngine itself is polled. This means that if GossipEngine is never polled it will leak memory.

  1. In substrate/client/beefy/src/worker.rs the BEEFY worker has the following run function:
pub(crate) async fn run(mut self) {
    info!(target: "beefy", "๐Ÿฅฉ run BEEFY worker, best grandpa: #{:?}.", self.best_grandpa_block_header.number());
    self.wait_for_runtime_pallet().await;
    // ...
    // <-- The `GossipEngine` is polled here.

...so it waits for the BEEFY runtime pallet, but since it doesn't actually exist it ends up waiting forever, while the events in network_event_stream inside of the GossipEngine continue being accumulated.

koute commented

Fix available here: #11694

bkchr commented
  1. In cumulus/client/relay-chain-inprocess-interface/src/lib.rs we initialize the relay chain node like this with enable_beefy being always unconditionally set to true:

@skunert please fix this. Beefy should always be disabled (at least for now), for the in process relay chain node.

irsal commented

Hey @bkchr - this ended up being a pretty gnarly issue for us (Turing Network on Kusama) when we upgraded to 0.9.23. https://github.com/OAK-Foundation/OAK-blockchain/wiki/2022-07-21:-Node-Memory-Leak-Issue

Was there any way we could've been notified of the implications for this version upgrade? Perhaps I'm just missing a notification channel or tag to track.

bkchr commented

@irsal sadly there is no good channel for this at the moment.