jitsi/jitsi-videobridge

The Videbridge needs a long time to fix connection problems to the XMPP server.

freym opened this issue · 8 comments

freym commented

This Issue tracker is only for reporting bugs and tracking code related issues.

Description

We discovered that the videobridge takes a long time to recover and reconnect if there is an short error in the connection to the XMPP Server (= restart of Prosody, changes in our Network configuration).

Current behavior

JVB 2023-09-25 14:43:46.120 WARNUNG: [136] [hostname=REDACTED id=public-jitsi] MucClient$PingFailedListenerImpl.pingFailed#864: Ping failed, the XMPP connection needs to reconnect.
JVB 2023-09-25 14:43:46.120 WARNUNG: [136] [hostname=REDACTED id=public-jitsi] MucClient$PingFailedListenerImpl.pingFailed#869: XMPP connection still connected, will trigger a disconnect.
JVB 2023-09-25 14:43:50.665 SCHWERWIEGEND: [33] [hostname=REDACTED id=public-jitsi] MucClient$MucWrapper.setPresenceExtensions#802: Failed to send stanza:
org.jivesoftware.smack.SmackException$NotConnectedException: The connection XMPPTCPConnection[jvb@auth.meet.jitsi/un6dsi1n475V] (0) is no longer connected. done=true smResumptionPossible=false
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.throwNotConnectedExceptionIfDoneAndResumptionNotPossible(XMPPTCPConnection.java:1242)
at org.jivesoftware.smack.tcp.XMPPTCPConnection.throwNotConnectedExceptionIfAppropriate(XMPPTCPConnection.java:351)
at org.jivesoftware.smack.AbstractXMPPConnection.sendStanza(AbstractXMPPConnection.java:858)
at org.jitsi.xmpp.mucclient.MucClient$MucWrapper.setPresenceExtensions(MucClient.java:798)
at org.jitsi.xmpp.mucclient.MucClient.lambda$setPresenceExtensions$5(MucClient.java:470)
at java.base/java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4780)
at org.jitsi.xmpp.mucclient.MucClient.setPresenceExtensions(MucClient.java:470)
at org.jitsi.xmpp.mucclient.MucClient.setPresenceExtension(MucClient.java:454)
at org.jitsi.xmpp.mucclient.MucClientManager.lambda$setPresenceExtension$0(MucClientManager.java:164)
at java.base/java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4780)
at org.jitsi.xmpp.mucclient.MucClientManager.setPresenceExtension(MucClientManager.java:163)
at org.jitsi.videobridge.xmpp.XmppConnection.setPresenceExtension(XmppConnection.kt:85)
at org.jitsi.videobridge.stats.MucStatsTransport.publishStatistics(MucStatsTransport.java:66)
at org.jitsi.videobridge.stats.StatsCollector$TransportPeriodicRunnable.doRun(StatsCollector.kt:148)
at org.jitsi.utils.concurrent.PeriodicRunnableWithObject.run(PeriodicRunnableWithObject.java:87)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor.run(RecurringRunnableExecutor.java:216)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor.runInThread(RecurringRunnableExecutor.java:292)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor$1.run(RecurringRunnableExecutor.java:328)
JVB 2023-09-25 14:43:51.122 INFORMATION: [137] org.jivesoftware.smack.AbstractXMPPConnection.waitForClosingStreamTagFromServer: Exception while waiting for closing stream element from the server XMPPTCPConnection[jvb@auth.meet.jitsi/un6dsi1n475V] (0)
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 5000ms (~5s). While waiting for closing stream tag from the server [XMPPTCPConnection[jvb@auth.meet.jitsi/un6dsi1n475V] (0)]
at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:99)
at org.jivesoftware.smack.AbstractXMPPConnection.waitForConditionOrThrowConnectionException(AbstractXMPPConnection.java:731)
at org.jivesoftware.smack.AbstractXMPPConnection.waitForClosingStreamTagFromServer(AbstractXMPPConnection.java:1017)
at org.jivesoftware.smack.tcp.XMPPTCPConnection.shutdown(XMPPTCPConnection.java:513)
at org.jivesoftware.smack.tcp.XMPPTCPConnection.shutdown(XMPPTCPConnection.java:493)
at org.jivesoftware.smack.AbstractXMPPConnection.disconnect(AbstractXMPPConnection.java:965)
at org.jivesoftware.smack.AbstractXMPPConnection.disconnect(AbstractXMPPConnection.java:937)
at org.jitsi.xmpp.mucclient.MucClient$PingFailedListenerImpl.pingFailed(MucClient.java:874)
at org.jivesoftware.smackx.ping.PingManager$8.processException(PingManager.java:465)
at org.jivesoftware.smackx.ping.PingManager$8.processException(PingManager.java:448)
at org.jivesoftware.smack.SmackFuture$2.run(SmackFuture.java:191)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)

Now every five seconds the following exception is printed to the logs:

JVB 2023-09-25 14:43:55.665 SCHWERWIEGEND: [33] [hostname=REDACTED id=public-jitsi] MucClient$MucWrapper.setPresenceExtensions#802: Failed to send stanza:
org.jivesoftware.smack.SmackException$NotConnectedException: The connection XMPPTCPConnection[jvb@auth.meet.jitsi/un6dsi1n475V] (0) is no longer connected. done=true smResumptionPossible=false
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.throwNotConnectedExceptionIfDoneAndResumptionNotPossible(XMPPTCPConnection.java:1242)
at org.jivesoftware.smack.tcp.XMPPTCPConnection.throwNotConnectedExceptionIfAppropriate(XMPPTCPConnection.java:351)
at org.jivesoftware.smack.AbstractXMPPConnection.sendStanza(AbstractXMPPConnection.java:858)
at org.jitsi.xmpp.mucclient.MucClient$MucWrapper.setPresenceExtensions(MucClient.java:798)
at org.jitsi.xmpp.mucclient.MucClient.lambda$setPresenceExtensions$5(MucClient.java:470)
at java.base/java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4780)
at org.jitsi.xmpp.mucclient.MucClient.setPresenceExtensions(MucClient.java:470)
at org.jitsi.xmpp.mucclient.MucClient.setPresenceExtension(MucClient.java:454)
at org.jitsi.xmpp.mucclient.MucClientManager.lambda$setPresenceExtension$0(MucClientManager.java:164)
at java.base/java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4780)
at org.jitsi.xmpp.mucclient.MucClientManager.setPresenceExtension(MucClientManager.java:163)
at org.jitsi.videobridge.xmpp.XmppConnection.setPresenceExtension(XmppConnection.kt:85)
at org.jitsi.videobridge.stats.MucStatsTransport.publishStatistics(MucStatsTransport.java:66)
at org.jitsi.videobridge.stats.StatsCollector$TransportPeriodicRunnable.doRun(StatsCollector.kt:148)
at org.jitsi.utils.concurrent.PeriodicRunnableWithObject.run(PeriodicRunnableWithObject.java:87)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor.run(RecurringRunnableExecutor.java:216)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor.runInThread(RecurringRunnableExecutor.java:292)
at org.jitsi.utils.concurrent.RecurringRunnableExecutor$1.run(RecurringRunnableExecutor.java:328)

After ~ 15 minutes the connection is closed and the videobridge successfully establishes a connection. The videobridge immediately establishes a connection when we manually restart the service. It ist also possible to connect to the XMPP Server with telnet after the error happens.

JVB 2023-09-25 14:57:13.185 INFORMATION: [136] [hostname=REDACTED id=public-jitsi] MucClient$2.connectionClosed#334: Closed.
JVB 2023-09-25 14:57:15.664 WARNUNG: [33] [hostname=REDACTED id=public-jitsi] MucClient.setPresenceExtensions#467: Cannot set presence extension: not connected.
JVB 2023-09-25 14:57:16.072 WARNUNG: [26] [hostname=REDACTED id=public-jitsi] MucClient.lambda$getConnectAndLoginCallable$9#640: Error connecting:
org.jivesoftware.smack.SmackException$SmackWrappedException: java.net.SocketException: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt
at org.jivesoftware.smack.AbstractXMPPConnection.setCurrentConnectionExceptionAndNotify(AbstractXMPPConnection.java:687)
at org.jivesoftware.smack.AbstractXMPPConnection.notifyConnectionError(AbstractXMPPConnection.java:987)
at org.jivesoftware.smack.tcp.XMPPTCPConnection.access$5400(XMPPTCPConnection.java:131)
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.writePackets(XMPPTCPConnection.java:1419)
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.access$4200(XMPPTCPConnection.java:1172)
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter$1.run(XMPPTCPConnection.java:1222)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.SocketException: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt
at java.base/sun.nio.ch.NioSocketImpl.implWrite(NioSocketImpl.java:425)
at java.base/sun.nio.ch.NioSocketImpl.write(NioSocketImpl.java:445)
at java.base/sun.nio.ch.NioSocketImpl$2.write(NioSocketImpl.java:831)
at java.base/java.net.Socket$SocketOutputStream.write(Socket.java:1035)
at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:234)
at java.base/sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:304)
at java.base/sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
at java.base/sun.nio.cs.StreamEncoder.write(StreamEncoder.java:132)
at java.base/sun.nio.cs.StreamEncoder.write(StreamEncoder.java:142)
at java.base/java.io.Writer.write(Writer.java:249)
at java.base/java.io.OutputStreamWriter.append(OutputStreamWriter.java:237)
at org.jivesoftware.smack.util.XmlStringBuilder.lambda$write$0(XmlStringBuilder.java:675)
at org.jivesoftware.smack.util.XmlStringBuilder.appendXmlTo(XmlStringBuilder.java:717)
at org.jivesoftware.smack.util.XmlStringBuilder.appendXmlTo(XmlStringBuilder.java:707)
at org.jivesoftware.smack.util.XmlStringBuilder.write(XmlStringBuilder.java:673)
at org.jivesoftware.smack.tcp.XMPPTCPConnection$PacketWriter.writePackets(XMPPTCPConnection.java:1356)
... 3 more

Expected Behavior

The videobridge should be more should be more tolerant of network errors and recover faster.

Environment details

The videobridge runs on a bare metal server (OS is Debian)
Prosody, Jicofo, Web are running on Kubernetes

Videobridge: 2.3-44-g8983b11f-1
Prosody: stable-8719
Jicofo: stable-8719

What is the value of net.ipv4.tcp_retries2 in sysconf?

freym commented

$ sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15

This means that the kernel will detect broken TCP connections after the 13-15 min mark.
You can try setting it to 7, this will give you roughly 25 to 50 seconds.

@freym Can you clarify steps to reproduce? I get a "closed on error" log immediately after restarting prosody (and then a successful reconnect).

freym commented

It happened when we restarted Prosody once, but we now think the problem is related to our network setup and Kubernetes. However, reducing net.ipv4.tcp_retries2 to 7 has alleviated the problem somewhat.

jbg commented

In k8s it could also happen if the prosody deployment is restarted and the new pod(s) take a long time to go Ready (eg due to readiness or startup probes); if you're using a service DNS name to reach them from JVB then it won't resolve to the new pod IP(s) until they're Ready.

Can also be related to a too-long TTL being served on those service A records (configured in CoreDNS config on most k8s deployments) so JVB still has the old IP for a while.

The "Ping failed, the XMPP connection needs to reconnect." message you see early on should trigger a reconnect. I'm surprised that in your case it doesn't log anything about trying to reconnect.

I just tested the case where the server stops responding. As expected the XMPP pings timeout in 30-60 seconds and they trigger a re-connect. I suspect it has something to do with your setup (feel free to reopen if you have more info or more questions).