NullPointerException with multiple network interfaces and NAT

Question

NullPointerException with multiple network interfaces and NAT

debfx opened this issue 4 years ago · 20 comments

I am getting the following NullPointerException in ice4j when running the Jitsi stable-4627-1 Docker stack. As a result no communication through the video bridge is possible. I've verified that the same happens when replacing the bundled ice4j library with the latest ice4j version (commit 7016902).

SEVERE: [54] org.ice4j.stack.NetAccessManager.handleFatalError: Unexpected Error!
java.lang.NullPointerException
        at org.ice4j.socket.MergingDatagramSocket.initializeActive(MergingDatagramSocket.java:577)
        at org.ice4j.ice.ComponentSocket.propertyChange(ComponentSocket.java:174)
        at org.ice4j.ice.IceMediaStream.firePairPropertyChange(IceMediaStream.java:870)
        at org.ice4j.ice.CandidatePair.nominate(CandidatePair.java:629)
        at org.ice4j.ice.Agent.nominate(Agent.java:1847)
        at org.ice4j.ice.DefaultNominator.strategyNominateFirstValid(DefaultNominator.java:144)
        at org.ice4j.ice.DefaultNominator.propertyChange(DefaultNominator.java:120)
        at org.ice4j.ice.IceMediaStream.firePairPropertyChange(IceMediaStream.java:870)
        at org.ice4j.ice.CandidatePair.validate(CandidatePair.java:667)
        at org.ice4j.ice.IceMediaStream.addToValidList(IceMediaStream.java:668)
        at org.ice4j.ice.Agent.validatePair(Agent.java:1811)
        at org.ice4j.ice.ConnectivityCheckClient.processSuccessResponse(ConnectivityCheckClient.java:638)
        at org.ice4j.ice.ConnectivityCheckClient.processResponse(ConnectivityCheckClient.java:405)
        at org.ice4j.stack.StunClientTransaction.handleResponse(StunClientTransaction.java:314)
        at org.ice4j.stack.StunStack.handleMessageEvent(StunStack.java:1040)
        at org.ice4j.stack.MessageProcessingTask.run(MessageProcessingTask.java:196)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Running the jvb Docker container means it runs behind a NAT and it has two network interfaces:

bridge: where it can communicate with the internet
meet.jitsi: internal network where it communicates with the other Jitsi components

In this setup I found one configuration where the exception does not happen:

Disable org.ice4j.ice.harvest.STUN_MAPPING_HARVESTER_ADDRESSES
Set org.ice4j.ice.harvest.NAT_HARVESTER_PUBLIC_ADDRESS to the public IP
Set org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS to the internal IP of the bridge interface

Notably the exception still occurs when NAT_HARVESTER_LOCAL_ADDRESS is set to the IP of the meet.jitsi interface.

I've done some debugging with the STUN server enabled and NAT_HARVESTER_PUBLIC_ADDRESS / NAT_HARVESTER_LOCAL_ADDRESS unset.

Before the exception occurs, I noticed the following in ConnectivityCheckClient::processSuccessResponse():

checkedPair contains the correct local IP

CandidatePair (State=In-Progress Priority=7961835276064522239):
LocalCandidate=candidate:2 1 udp 2130706431 <local IP of bridge interface> 10000 typ host
RemoteCandidate=candidate:10000 1 udp 1853759231 <CLIENT-IP> 9952 typ prflx

validPair contains the IP of the wrong (meet.jitsi) interface

CandidatePair (State=Frozen Priority=7205771497833250302):
LocalCandidate=candidate:3 1 udp 1677724415 <PUBLIC-IP> 10000 typ srflx raddr <local IP of meet.jitsi interface> rport 10000
RemoteCandidate=candidate:10000 1 udp 1853759231 <CLIENT-IP> 9952 typ prflx

Later in the stack trace ComponentSocket::propertyChange() calls localCandidate.getCandidateIceSocketWrapper(remoteAddress) which is SinglePortUdpHarvester$MyCandidate::getCandidateIceSocketWrapper().
The candidateSockets in that object is empty thus it returns null. This eventually leads to the NullPointerException.

Please let me know if I can help in any way to further debug this.

Answer 1 · 2020-06-03T20:30:40.000Z

It's worth noting that the exception doesn't always occur but after a few tries I could always reproduce them after a few attempts.

Complete jvb log leading up to the exception: jvb.log

Answer 2 · 2020-06-03T20:32:51.000Z

Does the problem occur every time, or only some of the time?

Are you able to collect full ICE logs?

Are you able to do a search of the history to see when the problem was introduced? In particular, I'm wondering if it might have been introduced by e90c796 , or possibly by one of the commits right before that.

Answer 3 · 2020-06-03T20:35:27.000Z

Sorry, race between your comment and mine, ignore my first two questions above.

Answer 4 · 2020-06-03T20:47:57.000Z

I'm having a little bit of trouble following your description - you said that validPair contains the wrong interface - is the internal Docker interface?

In the logs, I see both 172.23.0.5 and 172.22.0.6 addresses that haven't been elided - which ones are those?

Are you able to debug the Java process to tell which address is being passed to localCandidate.getCandidateIceSocketWrapper(remoteAddress)?

Answer 5 · 2020-06-03T21:06:02.000Z

I'm getting the same exception with commit 982e782
Do you want me to go back further?

172.22.0.6 is the IP of the bridge interface, 172.23.0.5 of the internal meet.jitsi interface.

So in this example validPair would have something like
LocalCandidate=candidate:3 1 udp 1677724415 <PUBLIC-IP> 10001 typ srflx raddr 172.23.0.5 rport 10001

Since 172.22.0.6 is the IP on the interface with the default route it seems to me that LocalCandidate should have that one.

default via 172.22.0.1 dev eth0 
172.22.0.0/16 dev eth0 scope link  src 172.22.0.6

I'll check what value remoteAddress has.

Answer 6 · 2020-06-03T21:12:58.000Z

Okay, if it still happens with 982e782 it's very unlikely to have been introduced by either of the two commits before it, but can you test with 36f732b just to be sure?

What is the STUN server set to? Something genuinely on the public Internet, or something local to the docker?

Answer 7 · 2020-06-03T21:18:58.000Z

remoteAddress is <CLIENT-IP>:43295/udp (i.e. public IP of the client)

I can test with the older commit but I need to backport 982e782 as jvb is using that interface.

The STUN server is set to meet-jit-si-turnrelay.jitsi.net:443

Answer 8 · 2020-06-03T21:24:39.000Z

Same exception with 36f732b (commit 982e782 backported).

Answer 9 · 2020-06-03T21:27:51.000Z

Okay, so the problem at least isn't from my recent patches. 🙂

@bgrozev do you have any insight?

Answer 10 · 2020-06-03T22:25:26.000Z

@debfx Can you please include the startup logs or grep them for Discovered public address?

My guess is that something goes wrong because we have two local candidates with the same address (but different rel-addr):

<candidate component='1' foundation='3' generation='0' id='21461fd35e31ee68091343f' network='0' priority='1677724415' protocol='udp' type='srflx' ip='<PUBLIC-IP>' port='10001' rel-addr='172.23.0.5' rel-port='10001'/>
<candidate component='1' foundation='4' generation='0' id='5fc15e35e31ee68091343f' network='0' priority='1677724415' protocol='udp' type='srflx' ip='<PUBLIC-IP>' port='10001' rel-addr='172.22.0.6' rel-port='10001'/>

Answer 11 · 2020-06-04T05:45:20.000Z

@bgrozev You are right, it discovers the public IP with both local addresses: jvb-startup.log

Interestingly both STUN requests are sent on the bridge interface (with source IP 172.22.0.6 and 172.23.0.5). The meet.jitsi interface (where 172.23.0.5 is configured) doesn't have a default route. The reply for 172.23.0.5 is routed through the meet.jitsi interface.

Answer 12 · 2020-06-04T07:39:37.000Z

support this issue. Currently, to solve this problem, I need to disable STUN server, and manually config DOCKER_HOST_ADDRESS for each jvb node, which prevents me from building a docker swarm with scalable jvb.

Answer 13 · 2020-06-04T13:38:05.000Z

@debfx Thanks for reporting and helping to trace this. Can you please test if the problem occurs with org.ice4j.ice.harvest.BLOCKED_INTERFACES=bridge? This is to confirm that it has to do with STUN and multiple interfaces, as opposed to something else docker-related.

Answer 14 · 2020-06-04T16:43:27.000Z

@bgrozev Setting BLOCKED_INTERFACES indeed fixes the problem.

Another data point: Disabling the STUN server and setting NAT_HARVESTER_LOCAL_ADDRESS to the IP of the internal interface also triggers the exception.

org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS=172.23.0.5
org.ice4j.ice.harvest.NAT_HARVESTER_PUBLIC_ADDRESS=<PUBLIC-IP>

Setting NAT_HARVESTER_LOCAL_ADDRESS=172.22.0.6 on the other hand works fine.

Answer 15 · 2020-06-04T17:32:52.000Z

Another data point: Disabling the STUN server and setting NAT_HARVESTER_LOCAL_ADDRESS to the IP of the internal interface also triggers the exception.

Good find. I can repro it this way it with a dummy virtual interface. I'll try and trace it, but it might have to wait for next week.

Answer 16 · 2020-06-04T19:14:19.000Z

Thanks a lot for looking into it!

Answer 17 · 2020-06-05T22:45:38.000Z

@debfx can you please see if #208 fixes the problem in your setup?

Answer 18 · 2020-06-06T09:48:26.000Z

I get a build error using the PR:

[ERROR] ice4j/src/main/java/org/ice4j/ice/Agent.java:[1480,16] reference to findLocalCandidate is ambiguous
[ERROR]   both method findLocalCandidate(org.ice4j.TransportAddress,org.ice4j.ice.LocalCandidate) in org.ice4j.ice.Agent and method findLocalCandidate(org.ice4j.TransportAddress,java.lang.String) in org.ice4j.ice.Agent match

I've changed the function to this and done some testing:

    public LocalCandidate findLocalCandidate(TransportAddress localAddress)
    {
        return findLocalCandidate(localAddress, (LocalCandidate)null);
    }

Using the STUN works fine now.

However I still get the NullPointerException when setting org.ice4j.ice.harvest.NAT_HARVESTER_LOCAL_ADDRESS=172.23.0.5

Answer 19 · 2020-06-08T16:53:13.000Z

The PR will also need #207 applied first, to remove the (unused) function that's giving the ambiguous reference.

The NPE now only happens if you explicitly lie to the config, i.e. tell it the local address is one that can't actually be used as the source of media, right? If so, I'm not too worried about that causing problems.

Answer 20 · 2020-06-08T18:57:35.000Z

Correct, with the PR applied I can only reproduce the NPE when ice4j is misconfigured.