ortuman/jackal

s2s lookup error

Closed this issue ยท 7 comments

jackal can't find a server, even though dig shows it resolves correctly.

2019-03-11 23:51:42 ๐Ÿ’ฅ [ERR] s2s/server:94 - lookup _xmpp-server._tcp.riotcat.org on 9.9.9.9:53: no such host
# dig riotcat.org

; <<>> DiG 9.10.3-P4-Debian <<>> riotcat.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59702
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;riotcat.org.			IN	A

;; ANSWER SECTION:
riotcat.org.		3586	IN	A	87.106.127.220

;; Query time: 0 msec
;; SERVER: 9.9.9.9#53(9.9.9.9)
;; WHEN: Mon Mar 11 23:52:57 UTC 2019
;; MSG SIZE  rcvd: 56

The error is consistent, I can't get any kind of connection to that server at all.

This isn't a global issue, I can successfully connect to other servers and send messages to users on them. According to compliance.conversations.im, the remote server is running Prosody 0.11.

Then when I was checking logs for more errors, I found that a server I can successfully communicate on is also returning the same error occasionally (2019-03-12 01:54:52 ๐Ÿ’ฅ [ERR] s2s/server:94 - lookup _xmpp-server._tcp.chat.404.city on 9.9.9.9:53: no such host). The remote server with the intermittent error is running ejabberd 18.12.1-2~bpo9+1.

Maybe it's something I can tweak in the config? s2s section of jackal.yml is the same as in the example:

s2s:
    dial_timeout: 15
    dialback_secret: s3cr3tf0rd14lb4ck
    max_stanza_size: 131072

    transport:
      bind_addr: 0.0.0.0
      port: 5269
      keep_alive: 600

Also, it looks like I can receive messages from the server that lookup always fails for.

Better dig lookup:

dig SRV _xmpp-server._tcp.riotcat.org

; <<>> DiG 9.11.5-P4-1-Debian <<>> SRV _xmpp-server._tcp.riotcat.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28650
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;_xmpp-server._tcp.riotcat.org.	IN	SRV

;; AUTHORITY SECTION:
riotcat.org.		300	IN	SOA	ns1042.ui-dns.org. hostmaster.kundenserver.de. 2016020102 28800 7200 604800 300

;; Query time: 239 msec
;; SERVER: 9.9.9.9#53(9.9.9.9)
;; WHEN: Tue Mar 12 23:27:01 -03 2019
;; MSG SIZE  rcvd: 134

Ok, I did some more debugging, it looks like the error is when it runs net.LookupSRV("xmpp-server", "tcp", "riotcat.org"). Possibly an occurance of this bug.

Other failed debugging to try to get around this bug, as hinted at by comments in the bug linked above:

  • setting limit of open files to unlimited (/etc/security/limits.conf)
  • changing to possibly faster dns servers (to decrease chance of timeout, tried google's 8.8.8.8 and cloudflare's 1.1.1.1)

Edit: the server I'm trying to connect to doesn't have a SRV record that's formatted correctly. (Or at all? Not super familiar with reading DNS.) I think the way some other servers/clients handle missing SRV records is trying the standard XMPP port to see if there's an XMPP client/server listening there.

Yes, you are right. Right now, if the service fails trying to do the SRV lookup we do not fallback to the standard XMPP port...
https://github.com/ortuman/jackal/blob/master/s2s/dial.go#L31-L40

Thanks for reporting. I'll update the issue as soon as the patch is applied.

This commit should fix the issue...

ea74824

Available on master branch, please @waveletlet could you confirm it's working now? Thanks! ๐Ÿ™

I also talked to the person running the server, and they told me they only set xmpps-client/xmpps-server, to force secure connections as described in XEP-0368, might be good to also try something like d.srvResolve("xmpps-server", "tcp", remoteDomain) for that case (which is probably the preferred case, anyway).

Looks like it works now!

Edit: Nevermind, the note about not seeing the warn message, I see it now. I think I just missed it somehow when I turned the server back on.