glidernet/ogn-rf

Protect ogn against dns attack

pyrog opened this issue ยท 19 comments

pyrog commented

After 2016 Dyn cyberattack, the ogn network was down for several hours, even days.

ogn_decode use the aprs.glidernet.org address to establish a connection with one of the OGN aprs servers.

If the dns servers are down, this address couldn't be resolved with the following consequences:

  1. It's not possible to track gliders in real time (minor issue)
  2. gliders locations are not received/logged locally, preventing any Search And Rescue (SAR) operations (big issue)

โš ๏ธ This problem is more general in fact: if a network failure occur, ogn-decode can't run.

See also Mirai (malware)

snip commented

OGN was not down.
Some isolated users were not able to view it.
Some isolated receivers were not able to send data to OGN servers.

pyrog commented

Ok, that's correct: some ๐Ÿ˜‰ (see wikipedia sources).

But the second issue is still there: ogn-decode can't run if there is no connection with an aprs server. (Whatever the real cause).

Please reopen this issue (and/or rename if needed) ๐Ÿ˜„

snip commented

Yes, if there is no network connectivity, it is not possible to communicate information.

pyrog commented

@snip, yes this is a minor issue.

But, it's not possible to receive and log any gliders, that could be a major issue in case of SAR.
Also, it prevent other applications to run locally (Map, flight logโ€ฆ)

pyrog commented

if there is no network connectivity, it is not possible to communicate information.

With the specific dns issue, the ip connexion is still running ๐Ÿ˜‰
It could be solved "temporarily" by editing the receiver configuration file and restarting the receiver :

APRS:
{ #Server = "aprs.glidernet.org:14580";    # choose one of the following ip address
  Server = "37.187.40.234:14580";          # glidern1.glidernet.org
# Server = "37.187.244.41:14580";          # glidern2.glidernet.org
# Server =  "85.188.1.173:14580";          # glidern3.glidernet.org
} ;

PS: Currently, aprs.glidernet.org is resolved by one of the 3 ogn aprs servers:

aprs.glidernet.org.      | aprs-pool.glidernet.org. | 86400
aprs-pool.glidernet.org. | 37.187.40.234            | 86400
aprs-pool.glidernet.org. | 37.187.244.41            | 86400
aprs-pool.glidernet.org. | 85.188.1.173             | 86400

As the time to live is 24 hours (86400 seconds), if one of the OGN servers is not reachable, the network stack can't connect to another server during one day (unless your reboot your receiver or flush it's DNS cache and restart ogn receiver).

Good point, assuming the DNS provider which hosts the domain can be assumed to be reliable then lowering the TTL to something well under an hour would make it a lot easier to get user data flowing to one of the other servers should the one they were feeding to suffer a hardware failure or other prolonged outage :)

snip commented

So with a three host round robin where one of the IP's returned is non functional then any new users attempting to connect via the round robin domain will have a 1:3 chance of having it resolved to the non functioning host...

Whereas any users who were connected to the non-functioning host at the time it went away, or had done so in the last 24 hours, would have the result of their original lookup of the round robin record cached for the remainder of the TTL... ie upto 24 hours.

High TTL's are great for things which dont change (often), like dns servers, or where there is application level redundancy, such as MX records for email....

However unless the TTL on the aprs.glidernet.org CNAME and aprs-pool.glidernet.org A records is lowered then in even of failure of one of the three APRS servers then 1/3 of the users are going to find themselves unable to feed data for up to 24 hours without manual intervention.

But perhaps that is desirable? ;)

snip commented

Each time the client tries to connect it has one chance over 3 to go to one given server. The 3 IPs are cached at same time. There is no expiration to wait before switching to another server.
So if one failed, at next try it should be able to connect to another one.

pyrog commented

I made several tests:

  • clear dns cache
  • query dns server dig aprs.glidernet.org
;; ANSWER SECTION:
aprs.glidernet.org.	86400	IN	A	85.188.1.173
aprs.glidernet.org.	86400	IN	A	37.187.244.41
aprs.glidernet.org.	86400	IN	A	37.187.40.234
  • query aprs.glidernet.org: curl aprs.glidernet.org:14501/status.json | head
	"server":	{
		"server_id":	"GLIDERN1",

When I repeat the curl command, I get always the response of glidern1.

If I flush the dns cache, the result is the sameโ€ฆ
I discovered that I must disconnect/reconnect the "router" to get the normal behavior (get a chance of 1/3 that the dns return adresses in another order).

snip commented

@pyrog it is known that some STB (like some Livebox) are not following RFC and change the behavior. This is maybe what you are experiencing. Can you confirm which STB you are using?

It all depends on how well the application was coded...

eg if it uses the getaddrinfo function then all of the A records will be cached and a potentially different result returned each time, as seems to be the belief will happen in most/all fail-over situations.

However if the code uses the archaic gethostbyname function then anything other than the first matching record will be ignored, hence if this happened to be the server that failed then any code on that device which attempts to lookup the same hostname using this function could be given the same result for as long as the TTL says it should be considered valid.

Now to look at the code:

https://github.com/glidernet/ogn-rf/blob/master/socket.h#L85

So IMHO either the code needs to be fixed or a more pragmatic approach to TTL's must be adopted, perhaps following folks who mostly know what they are doing:

$ dig www.google.com

; <<>> DiG 9.10.1 <<>> www.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52654
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.google.com.			IN	A

;; ANSWER SECTION:
www.google.com.		300	IN	A	172.217.26.68

;; Query time: 31 msec
;; SERVER: 127.0.1.1#53(127.0.1.1)
;; WHEN: Tue Feb 14 10:03:49 GMT 2017
;; MSG SIZE  rcvd: 48

$ dig www.github.com

; <<>> DiG 9.10.1 <<>> www.github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39105
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.github.com.			IN	A

;; ANSWER SECTION:
www.github.com.		3600	IN	CNAME	github.com.
github.com.		300	IN	A	192.30.253.113
github.com.		300	IN	A	192.30.253.112

;; Query time: 30 msec
;; SERVER: 127.0.1.1#53(127.0.1.1)
;; WHEN: Tue Feb 14 10:04:22 GMT 2017
;; MSG SIZE  rcvd: 78

$ dig www.level3.com

; <<>> DiG 9.10.1 <<>> www.level3.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64256
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.level3.com.			IN	A

;; ANSWER SECTION:
www.level3.com.		300	IN	CNAME	www.level3.com.c.footprint.net.
www.level3.com.c.footprint.net.	230 IN	A	205.128.82.126
www.level3.com.c.footprint.net.	230 IN	A	8.253.116.126
www.level3.com.c.footprint.net.	230 IN	A	208.178.29.254

;; Query time: 87 msec
;; SERVER: 127.0.1.1#53(127.0.1.1)
;; WHEN: Tue Feb 14 10:14:44 GMT 2017
;; MSG SIZE  rcvd: 124


pyrog commented

@snip Should be last Orange Livebox ๐Ÿ˜‘.
But until the phone line will be repaired, this is Orange Air Box ๐Ÿ˜ž

So IMHO either the code needs to be fixed

That's be fine if ogn-decode could managed fallback ๐Ÿ˜„
But this doesn't solve this issue for others aprs-ic clientsโ€ฆ

or a more pragmatic approach to TTL's must be adopted

Should work for every clients ๐Ÿ‘ if the IP stacks follow the standards ?
As ogn clients/servers run on several OS, hardwares and networks, the actual fallback behavior is not reliable.

We could write test scripts to check the fallback behavior and/or verify that all FLARM/OGN packets are logged even if the network connection is unavailable.

snip commented

For me, ogn-decode is already working well with round robin except if a STB in the middle is not following the RFC.
Reducing the TTL is a bad solution.

pyrog commented

@snip Ok, what do you suggest ?

pyrog commented

After readings some articles, I understand that the round robin DNS mechanism rely on "short" TTL.
Example:

  • receiver send a packet to server 1

  • after a i.e. 1 minute timeout, it retry.

  • With a 3 minutes DNS TTL, the receiver will retry maximum 3 times.

  • As the TTL as expired, the client will query the DNS and if it have chance, it will get the IP address of server 2 or 3.

If the TTL is set to 24 hours, the receiver will retry minimum one day before getting a chance to connect to another server ๐Ÿ˜ž.

aprs network use this mechanism, but the TTL is shorter ๐Ÿ˜‰

$ dig rotate.aprs2.net

;; ANSWER SECTION:
rotate.aprs2.net.	600	IN	A	205.209.228.93
rotate.aprs2.net.	600	IN	A	192.99.231.53
rotate.aprs2.net.	600	IN	A	71.14.221.33
rotate.aprs2.net.	600	IN	A	209.197.177.57
rotate.aprs2.net.	600	IN	A	54.64.6.78
rotate.aprs2.net.	600	IN	A	193.1.12.166
rotate.aprs2.net.	600	IN	A	185.5.97.128
rotate.aprs2.net.	600	IN	A	86.123.190.5

Does ogn-decode have an algorithm to manage a server loss (with a long TTL) ?

snip commented

DNS round-robin do not rely on TTL!
ogn-decode is working very well with this round robin, so i don't see value of changing things.

It may well be 'working very well' when all three servers are up, but as highlighted there are many configurations where should one of these servers fail then up to 1/3 of users will be unable to feed data for anything from 0 to the TTL of the result of the last time the round-robin was resolved..

Which although would be trivial to follow most of the major websites in using a considerably lower value, it seems to be that by refusing to consider either changing this or the OGN code means accepting that not being able to feed for up to 24 hours is something which users should be expected to live with.

IMHO this is far from an effective failover strategy and something which is going to cause pain when, not if, one of the three servers has an issue... and I dont appear to be the only one with that sentiment ;)

snip commented

@Romeo-Golf, there is no TTL involved in DNS round robin.
Client knows all IPs and decide which one to use.
You can try this to check: https://gist.github.com/snip/d87638bd4c3aaf2ca25d518b3986cf6a

Here what i get when launching the command multiple times win few seconds:

$ ./gethostbyname aprs.glidernet.org
aprs-pool.glidernet.org = 85.188.1.173 37.187.40.234 37.187.244.41
$ ./gethostbyname aprs.glidernet.org
aprs-pool.glidernet.org = 37.187.244.41 85.188.1.173 37.187.40.234
$ ./gethostbyname aprs.glidernet.org
aprs-pool.glidernet.org = 37.187.40.234 85.188.1.173 37.187.244.41
$ ./gethostbyname aprs.glidernet.org
aprs-pool.glidernet.org = 37.187.244.41 85.188.1.173 37.187.40.234
$ ./gethostbyname aprs.glidernet.org
aprs-pool.glidernet.org = 85.188.1.173 37.187.244.41 37.187.40.234
$

So we already have a good failover solution which is working well when we lose an aprs server.
I don't see any reason to add more complex process.