jitsi/ice4j

NullPointerException with multiple network interfaces and NAT

debfx opened this issue · 20 comments

debfx commented

I am getting the following NullPointerException in ice4j when running the Jitsi stable-4627-1 Docker stack. As a result no communication through the video bridge is possible. I've verified that the same happens when replacing the bundled ice4j library with the latest ice4j version (commit 7016902).

SEVERE: [54] org.ice4j.stack.NetAccessManager.handleFatalError: Unexpected Error!
java.lang.NullPointerException
        at org.ice4j.socket.MergingDatagramSocket.initializeActive(MergingDatagramSocket.java:577)
        at org.ice4j.ice.ComponentSocket.propertyChange(ComponentSocket.java:174)
        at org.ice4j.ice.IceMediaStream.firePairPropertyChange(IceMediaStream.java:870)
        at org.ice4j.ice.CandidatePair.nominate(CandidatePair.java:629)
        at org.ice4j.ice.Agent.nominate(Agent.java:1847)
        at org.ice4j.ice.DefaultNominator.strategyNominateFirstValid(DefaultNominator.java:144)
        at org.ice4j.ice.DefaultNominator.propertyChange(DefaultNominator.java:120)
        at org.ice4j.ice.IceMediaStream.firePairPropertyChange(IceMediaStream.java:870)
        at org.ice4j.ice.CandidatePair.validate(CandidatePair.java:667)
        at org.ice4j.ice.IceMediaStream.addToValidList(IceMediaStream.java:668)
        at org.ice4j.ice.Agent.validatePair(Agent.java:1811)
        at org.ice4j.ice.ConnectivityCheckClient.processSuccessResponse(ConnectivityCheckClient.java:638)
        at org.ice4j.ice.ConnectivityCheckClient.processResponse(ConnectivityCheckClient.java:405)
        at org.ice4j.stack.StunClientTransaction.handleResponse(StunClientTransaction.java:314)
        at org.ice4j.stack.StunStack.handleMessageEvent(StunStack.java:1040)
        at org.ice4j.stack.MessageProcessingTask.run(MessageProcessingTask.java:196)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Running the jvb Docker container means it runs behind a NAT and it has two network interfaces:

  • bridge: where it can communicate with the internet
  • meet.jitsi: internal network where it communicates with the other Jitsi components

In this setup I found one configuration where the exception does not happen:

  • Disable org.ice4j.ice.harvest.STUN_MAPPING_HARVESTER_ADDRESSES
  • Set org.ice4j.ice.harvest.NAT_HARVESTER_PUBLIC_ADDRESS to the public IP
  • Set org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS to the internal IP of the bridge interface

Notably the exception still occurs when NAT_HARVESTER_LOCAL_ADDRESS is set to the IP of the meet.jitsi interface.

I've done some debugging with the STUN server enabled and NAT_HARVESTER_PUBLIC_ADDRESS / NAT_HARVESTER_LOCAL_ADDRESS unset.

Before the exception occurs, I noticed the following in ConnectivityCheckClient::processSuccessResponse():

  • checkedPair contains the correct local IP

    CandidatePair (State=In-Progress Priority=7961835276064522239):
    LocalCandidate=candidate:2 1 udp 2130706431 <local IP of bridge interface> 10000 typ host
    RemoteCandidate=candidate:10000 1 udp 1853759231 <CLIENT-IP> 9952 typ prflx
    
  • validPair contains the IP of the wrong (meet.jitsi) interface

    CandidatePair (State=Frozen Priority=7205771497833250302):
    LocalCandidate=candidate:3 1 udp 1677724415 <PUBLIC-IP> 10000 typ srflx raddr <local IP of meet.jitsi interface> rport 10000
    RemoteCandidate=candidate:10000 1 udp 1853759231 <CLIENT-IP> 9952 typ prflx
    

Later in the stack trace ComponentSocket::propertyChange() calls localCandidate.getCandidateIceSocketWrapper(remoteAddress) which is SinglePortUdpHarvester$MyCandidate::getCandidateIceSocketWrapper().
The candidateSockets in that object is empty thus it returns null. This eventually leads to the NullPointerException.

Please let me know if I can help in any way to further debug this.

debfx commented

It's worth noting that the exception doesn't always occur but after a few tries I could always reproduce them after a few attempts.

Complete jvb log leading up to the exception: jvb.log

Does the problem occur every time, or only some of the time?

Are you able to collect full ICE logs?

Are you able to do a search of the history to see when the problem was introduced? In particular, I'm wondering if it might have been introduced by e90c796 , or possibly by one of the commits right before that.

Sorry, race between your comment and mine, ignore my first two questions above.

I'm having a little bit of trouble following your description - you said that validPair contains the wrong interface - is the internal Docker interface?

In the logs, I see both 172.23.0.5 and 172.22.0.6 addresses that haven't been elided - which ones are those?

Are you able to debug the Java process to tell which address is being passed to localCandidate.getCandidateIceSocketWrapper(remoteAddress)?

debfx commented

I'm getting the same exception with commit 982e782
Do you want me to go back further?

172.22.0.6 is the IP of the bridge interface, 172.23.0.5 of the internal meet.jitsi interface.

So in this example validPair would have something like
LocalCandidate=candidate:3 1 udp 1677724415 <PUBLIC-IP> 10001 typ srflx raddr 172.23.0.5 rport 10001

Since 172.22.0.6 is the IP on the interface with the default route it seems to me that LocalCandidate should have that one.

default via 172.22.0.1 dev eth0 
172.22.0.0/16 dev eth0 scope link  src 172.22.0.6

I'll check what value remoteAddress has.

Okay, if it still happens with 982e782 it's very unlikely to have been introduced by either of the two commits before it, but can you test with 36f732b just to be sure?

What is the STUN server set to? Something genuinely on the public Internet, or something local to the docker?

debfx commented

remoteAddress is <CLIENT-IP>:43295/udp (i.e. public IP of the client)

I can test with the older commit but I need to backport 982e782 as jvb is using that interface.

The STUN server is set to meet-jit-si-turnrelay.jitsi.net:443

debfx commented

Same exception with 36f732b (commit 982e782 backported).

Okay, so the problem at least isn't from my recent patches. 🙂

@bgrozev do you have any insight?

@debfx Can you please include the startup logs or grep them for Discovered public address?

My guess is that something goes wrong because we have two local candidates with the same address (but different rel-addr):

<candidate component='1' foundation='3' generation='0' id='21461fd35e31ee68091343f' network='0' priority='1677724415' protocol='udp' type='srflx' ip='<PUBLIC-IP>' port='10001' rel-addr='172.23.0.5' rel-port='10001'/>
<candidate component='1' foundation='4' generation='0' id='5fc15e35e31ee68091343f' network='0' priority='1677724415' protocol='udp' type='srflx' ip='<PUBLIC-IP>' port='10001' rel-addr='172.22.0.6' rel-port='10001'/>

debfx commented

@bgrozev You are right, it discovers the public IP with both local addresses: jvb-startup.log

Interestingly both STUN requests are sent on the bridge interface (with source IP 172.22.0.6 and 172.23.0.5). The meet.jitsi interface (where 172.23.0.5 is configured) doesn't have a default route. The reply for 172.23.0.5 is routed through the meet.jitsi interface.

tu-nv commented

support this issue. Currently, to solve this problem, I need to disable STUN server, and manually config DOCKER_HOST_ADDRESS for each jvb node, which prevents me from building a docker swarm with scalable jvb.

@debfx Thanks for reporting and helping to trace this. Can you please test if the problem occurs with org.ice4j.ice.harvest.BLOCKED_INTERFACES=bridge? This is to confirm that it has to do with STUN and multiple interfaces, as opposed to something else docker-related.

debfx commented

@bgrozev Setting BLOCKED_INTERFACES indeed fixes the problem.

Another data point: Disabling the STUN server and setting NAT_HARVESTER_LOCAL_ADDRESS to the IP of the internal interface also triggers the exception.

org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS=172.23.0.5
org.ice4j.ice.harvest.NAT_HARVESTER_PUBLIC_ADDRESS=<PUBLIC-IP>

Setting NAT_HARVESTER_LOCAL_ADDRESS=172.22.0.6 on the other hand works fine.

Another data point: Disabling the STUN server and setting NAT_HARVESTER_LOCAL_ADDRESS to the IP of the internal interface also triggers the exception.

Good find. I can repro it this way it with a dummy virtual interface. I'll try and trace it, but it might have to wait for next week.

debfx commented

Thanks a lot for looking into it!

@debfx can you please see if #208 fixes the problem in your setup?

debfx commented

I get a build error using the PR:

[ERROR] ice4j/src/main/java/org/ice4j/ice/Agent.java:[1480,16] reference to findLocalCandidate is ambiguous
[ERROR]   both method findLocalCandidate(org.ice4j.TransportAddress,org.ice4j.ice.LocalCandidate) in org.ice4j.ice.Agent and method findLocalCandidate(org.ice4j.TransportAddress,java.lang.String) in org.ice4j.ice.Agent match

I've changed the function to this and done some testing:

    public LocalCandidate findLocalCandidate(TransportAddress localAddress)
    {
        return findLocalCandidate(localAddress, (LocalCandidate)null);
    }

Using the STUN works fine now.

However I still get the NullPointerException when setting org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS=172.23.0.5

The PR will also need #207 applied first, to remove the (unused) function that's giving the ambiguous reference.

The NPE now only happens if you explicitly lie to the config, i.e. tell it the local address is one that can't actually be used as the source of media, right? If so, I'm not too worried about that causing problems.

debfx commented

Correct, with the PR applied I can only reproduce the NPE when ice4j is misconfigured.