bitcoin-dev-project/sim-ln

High CPU usage in between payments

Closed this issue · 3 comments

Describe the bug

When sim-ln is waiting until the next payment, the CPU usage jumps to 100%.

I believe this can be traced back to the produce_simulation_results function. It contains a loop with a tokio::select! statement that has 3 branches. The last branch is the following:

track_payment = set.join_next() => {
    if let Some(res) = track_payment {
        match res {
            Ok(track_payment_res) => {
                track_payment_res?
            },
            Err(_) => {
                return Err(SimulationError::TaskError);
            },
        }
    }
}

The set.join_next() function documentation says it "Returns None if the set is empty." When sim-ln is not processing payments the set will be empty, so the join_next() function will continuously return None.

The tokio::select! docs say "Waits on multiple concurrent branches, returning when the first branch completes, cancelling the remaining branches."

Because the set.join_next() branch is immediately returning None, it will be the first branch to complete and the loop will iterate. This leads to a high speed loop always running while sim-ln waits for another payment.

To Reproduce

Run a simple simulation that has extended time between payments.

Config

Running a simple simulation: 2 nodes, 1 channel, random payment activity.

@carlaKC @enigbe Have either of you experienced this issue? I am wondering if produce_simulation_results should not be using join_next() since it returns None immediately most of the time.

I am wondering if produce_simulation_results should not be using join_next() since it returns None immediately most of the time.

Definitely not, nice catch! Would be nice to wait for all the track_payment_res to clean up before we exit this function - in golang-land I'd remove them from the loop and just defer waiting, afaik rust doesn't have something like this (closest is custom impl of drop perhaps?)

I am wondering if produce_simulation_results should not be using join_next() since it returns None immediately most of the time.

Definitely not, nice catch! Would be nice to wait for all the track_payment_res to clean up before we exit this function - in golang-land I'd remove them from the loop and just defer waiting, afaik rust doesn't have something like this (closest is custom impl of drop perhaps?)

I moved the join_next() outside of the tokio::select! statement and this seemed to fix the issue. I opened a PR, but there might be a better way to do it.

I also added some error checks in the run function that help clean up tasks if that function errors out early for some reason. I was running into an issue where activity_executors was throwing an error and returning, but because the data collection tasks had already started they were not getting shutdown cleanly.

Let me know what you think on that PR, open to feedback... this was just my first draft of fixes to get past those issues.