File descriptor (socket) leak

Question

File descriptor (socket) leak

qbx2 opened this issue 3 years ago · 14 comments

Sockets opened in ICEGatherer seem to be leaked. They will never be closed even when the PeerConnections are closed. Finally, my server becomes unavailable with too many open files error.

I've investigated into webrtc's source code, and found that CandidateBase.close() is an no-op. Of course, it is because tokio's UdpSocket does not provide close(). Despite all, the sockets should be closed when it is dropped. Therefore, I guess that the socket is not dropped.

RTCPeerConnection holds Arc<PeerConnectionInternal> and there are other holders too. In v0.2.1, the peerconnection has no problem, dropped well. However, other holders seems not to drop it. I have no idea where the others are, but internal_rtcp_writer (peer_connection.rs, line 170) or pci (peer_connection.rs, line 1163, 1345, 1393) may have created reference cycle.

If possible, those references should be replaced by std::sync::Weak to break the cycle. Pion or other webrtc libraries written in garbage-collected languages may not suffer from this issue because the GC can detect and free those circular references well. Because Rust does not have one, we should use weak references to avoid this problem. It will also fix other memory leaks too.

Answer 1 · 2021-10-26T07:09:26.000Z

Thanks for reporting this issue.

Based on your information, I just used Valgrind on data-channels-flow-control example, and it detected at least two Leaks, including:
webrtc_sctp::association::Association::new
webrtc::api::API::new_ice_gatherer

I will fix these leaks ASAP

Answer 2 · 2021-10-27T02:41:48.000Z

I was thinking of checking Arc usage too. In my another probject (using GStreamer for WebRTC), Arc downgrade/upgrade solves the cycle problem.

Answer 3 · 2021-10-31T05:52:03.000Z

I think I have made some progress on fixing reference cycle caused memory leak.

reflect and data-channels-flow-control in examples v0.3.0 branch should be memory leak free now.

I have checked by using nightly rust version with following build
RUSTFLAGS="-Z sanitizer=leak" cargo build --example reflect

It reports no memory leak.

I will keep checking and fixing other examples to make sure there is no reference cycle caused memory leak

Answer 4 · 2021-10-31T14:34:06.000Z

Unfortunately, I don't think the file descriptor leak is fixed yet (on master branch). I am going to look further into it on Monday

Answer 5 · 2021-10-31T23:53:15.000Z

Unfortunately, I don't think the file descriptor leak is fixed yet (on master branch). I am going to look further into it on Monday

All examples in v0.3.0 branch of examples repo passed rc-cycle check. Please note that it requires some changes on callback function:

for example, here we have to use Weak pointer of peer_connection in its on_track callback, otherwise, it will cause reference cycle leak.

let pc = Arc::downgrade(&peer_connection);
peer_connection
        .on_track(Box::new(
            move |track: Option<Arc<TrackRemote>>, _receiver: Option<Arc<RTCRtpReceiver>>| {

Could you check your code whether have such callback?

Answer 6 · 2021-11-01T04:26:05.000Z

RTCPeerConnection is dropped very well, and the file descriptor closes well on normal cases. It seems to be leaked when it fails to connect.

Steps to reproduce:

I added the following code to CandidateBase.close() (line 255)

        if let Some(conn) = &self.conn {
            let _ = conn.close().await;

            use std::sync::Weak;
            use std::time::Duration;
            let conn_weak = Arc::downgrade(conn);
            tokio::spawn(async move {
                let mut int = tokio::time::interval(Duration::from_secs(1));
                loop {
                    int.tick().await;
                    dbg!(Weak::weak_count(&conn_weak));
                    if dbg!(Weak::strong_count(&conn_weak)) == 0 {
                        break;
                    }
                }
            });
        }

Signal(offer) from other peer and forcefully kill the offer-side process
My program spams this log for a while
[2021-11-01T04:17:16Z TRACE webrtc_ice::agent::agent_internal] [controlled]: pinging all candidates
ICE connection state changes to failed
conn_weak still has strong_count of 1

[/Users/qbx2/ice/src/candidate/candidate_base.rs:263] Weak::weak_count(&conn_weak) = 1
[/Users/qbx2/ice/src/candidate/candidate_base.rs:264] Weak::strong_count(&conn_weak) = 1

I reproduced using ~50 clients.

Answer 7 · 2021-11-13T22:47:48.000Z

RTCPeerConnection is dropped very well, and the file descriptor closes well on normal cases. It seems to be leaked when it fails to connect.

Steps to reproduce:

I added the following code to CandidateBase.close() (line 255)

        if let Some(conn) = &self.conn {
            let _ = conn.close().await;

            use std::sync::Weak;
            use std::time::Duration;
            let conn_weak = Arc::downgrade(conn);
            tokio::spawn(async move {
                let mut int = tokio::time::interval(Duration::from_secs(1));
                loop {
                    int.tick().await;
                    dbg!(Weak::weak_count(&conn_weak));
                    if dbg!(Weak::strong_count(&conn_weak)) == 0 {
                        break;
                    }
                }
            });
        }

Signal(offer) from other peer and forcefully kill the offer-side process
My program spams this log for a while
[2021-11-01T04:17:16Z TRACE webrtc_ice::agent::agent_internal] [controlled]: pinging all candidates
ICE connection state changes to failed
conn_weak still has strong_count of 1

[/Users/qbx2/ice/src/candidate/candidate_base.rs:263] Weak::weak_count(&conn_weak) = 1
[/Users/qbx2/ice/src/candidate/candidate_base.rs:264] Weak::strong_count(&conn_weak) = 1

I reproduced using ~50 clients.

@qbx2, I tried to reproduce this issue in ice crate with ping_pong example. From new ping_pong example, ice crate looks like not root cause for such leak.

Therefore, I think the leak may come from webrtc crate itself, so I fixed one possible issue in 0944adf, where endpoint forgets to call underlying conn.close().

Could you try again?

In addition, in order to reproduce this issue easily, do you think any example in https://github.com/webrtc-rs/examples that can be hacked to reproduce this issue? With the same codebase, it is more easy to debug.

Thank you.

Answer 8 · 2021-11-14T07:02:01.000Z

I tried webrtc on master branch, but had no luck. :(
Also, I am sorry about the codebase. I have a problem that my code is related to company stuff and that's why I could not share my code details. Therefore, I am going to make an example for reproduction to share with you. (I guess it is difficult to reproduce with the examples because it has different signaling protocol?)

Answer 9 · 2021-11-14T14:07:39.000Z

@rainliu Please, check this out: qbx2/webrtc-rs-examples@5f5c3a5

Answer 10 · 2021-11-14T17:38:59.000Z

@rainliu Please, check this out: qbx2/examples@5f5c3a5

@qbx2, thanks for this example, I can reproduce this issue.

Answer 11 · 2021-11-15T07:18:53.000Z

@qbx2, I think I found the root cause of the bug. e7d72d2 should fix this issue.

The root cause is that cancel_tx signal should be set to internal.cancel_tx before call agent.dial or agent.accept, otherwise, calling agent.close() will have no-op on cancellation of dial/accept.

            let (cancel_tx, cancel_rx) = mpsc::channel(1);
            {
                let mut internal = self.internal.lock().await;
                internal.role = role;
                internal.cancel_tx = Some(cancel_tx);
            }

            let conn: Arc<dyn Conn + Send + Sync> = match role {
                RTCIceRole::Controlling => {
                    agent
                        .dial(
                            cancel_rx,
                            params.username_fragment.clone(),
                            params.password.clone(),
                        )
                        .await?
                }
                RTCIceRole::Controlled => {
                    agent
                        .accept(
                            cancel_rx,
                            params.username_fragment.clone(),
                            params.password.clone(),
                        )
                        .await?
                }
                _ => return Err(Error::ErrICERoleUnknown),

in your example, you still need to send done signal to call peer_connection.close().await, otherwise, it won't call self.internal.ice_transport.stop().await, which is used to cancel agent.dial/accept.

peer_connection
        .on_peer_connection_state_change(Box::new(move |s: RTCPeerConnectionState| {
            println!("Peer Connection State has changed: {}", s);
            if s == RTCPeerConnectionState::Failed {
                ...
                println!("Peer Connection has gone to failed exiting");
                // let _ = done_tx.try_send(());
            }
...

Answer 12 · 2021-11-15T07:23:35.000Z

By the way, in your 50 clients call case, does each client have one peer_connection? If yes, once the following callback has indicated Failed, you may need to close the peer_connection for such client.

peer_connection
.on_ice_connection_state_change
or
peer_connection
.on_peer_connection_state_change

Answer 13 · 2021-11-15T08:51:40.000Z

in your example, you still need to send done signal to call peer_connection.close().await, otherwise, it won't call self.internal.ice_transport.stop().await, which is used to cancel agent.dial/accept.

You're right. The reason that I commented that line out was to keep the process alive so that the leak could be confirmed. Thanks a lot for the advice, I am caring of closing peer_connection.

By the way, in your 50 clients call case, does each client have one peer_connection? If yes, once the following callback has indicated Failed, you may need to close the peer_connection for such client.

Yes, they have. However, peer_connections are designed to be dropped when it failed to connect (I use Weak). It has no problem, right?
Also, I've confirmed that e7d72d2 fixes this issue. Thank you so much.

Answer 14 · 2021-11-15T15:55:31.000Z

thanks for confirm. I just released v0.3.1 to include this fix.