NullPointerException with multiple network interfaces and NAT
debfx opened this issue · 20 comments
I am getting the following NullPointerException in ice4j when running the Jitsi stable-4627-1 Docker stack. As a result no communication through the video bridge is possible. I've verified that the same happens when replacing the bundled ice4j library with the latest ice4j version (commit 7016902).
SEVERE: [54] org.ice4j.stack.NetAccessManager.handleFatalError: Unexpected Error!
java.lang.NullPointerException
at org.ice4j.socket.MergingDatagramSocket.initializeActive(MergingDatagramSocket.java:577)
at org.ice4j.ice.ComponentSocket.propertyChange(ComponentSocket.java:174)
at org.ice4j.ice.IceMediaStream.firePairPropertyChange(IceMediaStream.java:870)
at org.ice4j.ice.CandidatePair.nominate(CandidatePair.java:629)
at org.ice4j.ice.Agent.nominate(Agent.java:1847)
at org.ice4j.ice.DefaultNominator.strategyNominateFirstValid(DefaultNominator.java:144)
at org.ice4j.ice.DefaultNominator.propertyChange(DefaultNominator.java:120)
at org.ice4j.ice.IceMediaStream.firePairPropertyChange(IceMediaStream.java:870)
at org.ice4j.ice.CandidatePair.validate(CandidatePair.java:667)
at org.ice4j.ice.IceMediaStream.addToValidList(IceMediaStream.java:668)
at org.ice4j.ice.Agent.validatePair(Agent.java:1811)
at org.ice4j.ice.ConnectivityCheckClient.processSuccessResponse(ConnectivityCheckClient.java:638)
at org.ice4j.ice.ConnectivityCheckClient.processResponse(ConnectivityCheckClient.java:405)
at org.ice4j.stack.StunClientTransaction.handleResponse(StunClientTransaction.java:314)
at org.ice4j.stack.StunStack.handleMessageEvent(StunStack.java:1040)
at org.ice4j.stack.MessageProcessingTask.run(MessageProcessingTask.java:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Running the jvb Docker container means it runs behind a NAT and it has two network interfaces:
- bridge: where it can communicate with the internet
- meet.jitsi: internal network where it communicates with the other Jitsi components
In this setup I found one configuration where the exception does not happen:
- Disable
org.ice4j.ice.harvest.STUN_MAPPING_HARVESTER_ADDRESSES
- Set
org.ice4j.ice.harvest.NAT_HARVESTER_PUBLIC_ADDRESS
to the public IP - Set
org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS
to the internal IP of the bridge interface
Notably the exception still occurs when NAT_HARVESTER_LOCAL_ADDRESS
is set to the IP of the meet.jitsi interface.
I've done some debugging with the STUN server enabled and NAT_HARVESTER_PUBLIC_ADDRESS
/ NAT_HARVESTER_LOCAL_ADDRESS
unset.
Before the exception occurs, I noticed the following in ConnectivityCheckClient::processSuccessResponse():
-
checkedPair
contains the correct local IPCandidatePair (State=In-Progress Priority=7961835276064522239): LocalCandidate=candidate:2 1 udp 2130706431 <local IP of bridge interface> 10000 typ host RemoteCandidate=candidate:10000 1 udp 1853759231 <CLIENT-IP> 9952 typ prflx
-
validPair
contains the IP of the wrong (meet.jitsi) interfaceCandidatePair (State=Frozen Priority=7205771497833250302): LocalCandidate=candidate:3 1 udp 1677724415 <PUBLIC-IP> 10000 typ srflx raddr <local IP of meet.jitsi interface> rport 10000 RemoteCandidate=candidate:10000 1 udp 1853759231 <CLIENT-IP> 9952 typ prflx
Later in the stack trace ComponentSocket::propertyChange()
calls localCandidate.getCandidateIceSocketWrapper(remoteAddress)
which is SinglePortUdpHarvester$MyCandidate::getCandidateIceSocketWrapper()
.
The candidateSockets
in that object is empty thus it returns null. This eventually leads to the NullPointerException.
Please let me know if I can help in any way to further debug this.
It's worth noting that the exception doesn't always occur but after a few tries I could always reproduce them after a few attempts.
Complete jvb log leading up to the exception: jvb.log
Does the problem occur every time, or only some of the time?
Are you able to collect full ICE logs?
Are you able to do a search of the history to see when the problem was introduced? In particular, I'm wondering if it might have been introduced by e90c796 , or possibly by one of the commits right before that.
Sorry, race between your comment and mine, ignore my first two questions above.
I'm having a little bit of trouble following your description - you said that validPair contains the wrong interface - is the internal Docker interface?
In the logs, I see both 172.23.0.5 and 172.22.0.6 addresses that haven't been elided - which ones are those?
Are you able to debug the Java process to tell which address is being passed to localCandidate.getCandidateIceSocketWrapper(remoteAddress)
?
I'm getting the same exception with commit 982e782
Do you want me to go back further?
172.22.0.6 is the IP of the bridge interface, 172.23.0.5 of the internal meet.jitsi interface.
So in this example validPair would have something like
LocalCandidate=candidate:3 1 udp 1677724415 <PUBLIC-IP> 10001 typ srflx raddr 172.23.0.5 rport 10001
Since 172.22.0.6 is the IP on the interface with the default route it seems to me that LocalCandidate should have that one.
default via 172.22.0.1 dev eth0
172.22.0.0/16 dev eth0 scope link src 172.22.0.6
I'll check what value remoteAddress
has.
remoteAddress
is <CLIENT-IP>:43295/udp
(i.e. public IP of the client)
I can test with the older commit but I need to backport 982e782 as jvb is using that interface.
The STUN server is set to meet-jit-si-turnrelay.jitsi.net:443
Okay, so the problem at least isn't from my recent patches. 🙂
@bgrozev do you have any insight?
@debfx Can you please include the startup logs or grep them for Discovered public address
?
My guess is that something goes wrong because we have two local candidates with the same address (but different rel-addr):
<candidate component='1' foundation='3' generation='0' id='21461fd35e31ee68091343f' network='0' priority='1677724415' protocol='udp' type='srflx' ip='<PUBLIC-IP>' port='10001' rel-addr='172.23.0.5' rel-port='10001'/>
<candidate component='1' foundation='4' generation='0' id='5fc15e35e31ee68091343f' network='0' priority='1677724415' protocol='udp' type='srflx' ip='<PUBLIC-IP>' port='10001' rel-addr='172.22.0.6' rel-port='10001'/>
@bgrozev You are right, it discovers the public IP with both local addresses: jvb-startup.log
Interestingly both STUN requests are sent on the bridge interface (with source IP 172.22.0.6 and 172.23.0.5). The meet.jitsi interface (where 172.23.0.5 is configured) doesn't have a default route. The reply for 172.23.0.5 is routed through the meet.jitsi interface.
support this issue. Currently, to solve this problem, I need to disable STUN server, and manually config DOCKER_HOST_ADDRESS for each jvb node, which prevents me from building a docker swarm with scalable jvb.
@debfx Thanks for reporting and helping to trace this. Can you please test if the problem occurs with org.ice4j.ice.harvest.BLOCKED_INTERFACES=bridge
? This is to confirm that it has to do with STUN and multiple interfaces, as opposed to something else docker-related.
@bgrozev Setting BLOCKED_INTERFACES indeed fixes the problem.
Another data point: Disabling the STUN server and setting NAT_HARVESTER_LOCAL_ADDRESS to the IP of the internal interface also triggers the exception.
org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS=172.23.0.5
org.ice4j.ice.harvest.NAT_HARVESTER_PUBLIC_ADDRESS=<PUBLIC-IP>
Setting NAT_HARVESTER_LOCAL_ADDRESS=172.22.0.6
on the other hand works fine.
Another data point: Disabling the STUN server and setting NAT_HARVESTER_LOCAL_ADDRESS to the IP of the internal interface also triggers the exception.
Good find. I can repro it this way it with a dummy virtual interface. I'll try and trace it, but it might have to wait for next week.
Thanks a lot for looking into it!
I get a build error using the PR:
[ERROR] ice4j/src/main/java/org/ice4j/ice/Agent.java:[1480,16] reference to findLocalCandidate is ambiguous
[ERROR] both method findLocalCandidate(org.ice4j.TransportAddress,org.ice4j.ice.LocalCandidate) in org.ice4j.ice.Agent and method findLocalCandidate(org.ice4j.TransportAddress,java.lang.String) in org.ice4j.ice.Agent match
I've changed the function to this and done some testing:
public LocalCandidate findLocalCandidate(TransportAddress localAddress)
{
return findLocalCandidate(localAddress, (LocalCandidate)null);
}
Using the STUN works fine now.
However I still get the NullPointerException when setting org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS=172.23.0.5
The PR will also need #207 applied first, to remove the (unused) function that's giving the ambiguous reference.
The NPE now only happens if you explicitly lie to the config, i.e. tell it the local address is one that can't actually be used as the source of media, right? If so, I'm not too worried about that causing problems.
Correct, with the PR applied I can only reproduce the NPE when ice4j is misconfigured.