subspacecommunity/subspace

DNS Issue, requires container restart every hour

BrokenRunner opened this issue · 13 comments

Host: Ubuntu 20.04
Reverse SSL Proxy: Apache

Describe the bug:
At random intervals, DNS stops resolving for clients over wireguard tunnel. Non-Domain-Joined Windows based clients also appear to not get the search domain from the host. Unix-based clients search domain and DNS work fine. All Windows Clients are using Wireguard application 0.1.1 (latest as at time of writing).
Using a local nameserver in my docker-compose.

Docker Compose:

version: "3.3"
services:
  subspace:
   image: subspacecommunity/subspace:latest
   container_name: subspace
   volumes:
    - /usr/bin/wg:/usr/bin/wg
    - /opt/subspace/data:/data
    - /lib/x86_64-linux-gnu/libc.so.6:/lib/x86_64-linux-gnu/libc.so.6:ro
    - /lib64/ld-linux-x86-64.so.2:/lib64/ld-linux-x86-64.so.2:ro
   restart: always
   environment:
    - SUBSPACE_HTTP_HOST=xxxxx.xxxxxxx.com
    - SUBSPACE_LETSENCRYPT=false
    - SUBSPACE_HTTP_INSECURE=true
    - SUBSPACE_HTTP_ADDR=":8333"
    - SUBSPACE_NAMESERVER=172.16.10.11
    - SUBSPACE_LISTENPORT=51820
    - SUBSPACE_IPV4_POOL=10.99.97.0/24
    - SUBSPACE_IPV6_POOL=fd00::10:97:0/64
    - SUBSPACE_IPV4_GW=10.99.97.1
    - SUBSPACE_IPV6_GW=fd00::10:97:1
    - SUBSPACE_IPV6_NAT_ENABLED=0
   cap_add:
    - NET_ADMIN
   network_mode: "host"

To Reproduce:
Steps to reproduce the behavior:
Connect a client, kill switch disabled, otherwise all standard settings.
Client will connect and work for a period of time without issue.
Non-Domain-Joined Windows clients only able to use FQDN to resolve hosts, search domain doesnt work
After a somewhat random period of time, DNS appears to stop resolving and clients lose access to internal resources. Pinging an internal IP continues to work, indicating tunnel hasnt been dropped.
Some have complained of losing all local Internet access as well, despite Kill-switch being unticked in the client GUI. Havent been able to confirm this.

To work around this issue, I have setup a cron script running hourly to restart the subspace container on the host. This has so far resolved the issue, but is not desirable in the long term (or really even in the short term).
This makes me feel like the issue is NOT related to the client GUI as I thought it might have been initially.

Has anyone else experienced anything like this before?
All instructions have been followed as per install document, systemd-resolved has been removed, resolv.conf has been hard coded with local information due to Ubuntu constantly emptying it.

I should add that I cant see anything in system logs or docker logs on the Host.

Seeing the same issue. Are you using the existing dockerfile or did you update to latest alpine?

Any ideas appreciated as restarting is far from ideal.

I havnt changed anything, whatever subspacecommunity/subspace:latest pulls is what im using.

What I can say is further testing shows MacOS also has DNS Search Domain Issue (Appears this works ONLY for linux based peers),
Enabling the "Exclude Private IP's" tickbox in the MacOS Wireguard client (v0.0.20191105) kills everything. No local DNS resolution, No access to remote resources, nothing. This likely isnt Subspace but the MacOS client instead.

How Can I test the Alpine version? Sorry noob question.

Try killing systemd-resolved on the host, you may have multiple DNS services listening on same interface :
sudo netstat -tanpu | grep LISTEN|grep 53 would show you listening services.

Any other service than dnsmasq from docker instance should be killed (disabled) with sudo systemctl stop <service> and then sudo systemctl disable <service>

Hope it helps

Systemd-resolved was disabled as per setup instructions. DNS actually works for a while and then stops. Dnsmasq is still running but no responses come through.

Yes I can confirm the same, systemd-resolved is disabled and stopped.

This morning I started systemd-resolved, rebooted the server,
Stopped systemd-resolved, rebooted the server.
Disabled my Cron job rebooting the subspace docker container every hour.

Asked several staff to test for me (~10am). The first came back to me @ 12:45pm stating they had lost internet access (DNS stopped working is what they mean to say..).

Ran my restart-script the cron job calls:

#!/bin/bash /usr/bin/docker restart subspace | /opt/timestamp.sh >>/opt/subspace/data/logs/cron.log

Checked with the staff member = they were working again instantly.
Its not the client, its the container/image. Or should I say, it appears to be dnsmasq INSIDE the container. This has been this way since day one.

I think having the ability to remove dnsmasq and utilise something external to the container would be the best option, not sure how easy that is.

For completeness:

sudo netstat -tanpu | grep LISTEN|grep 53
tcp        0      0 0.0.0.0:53              0.0.0.0:*               LISTEN      505201/dnsmasq
tcp6       0      0 :::53                   :::*                    LISTEN      505201/dnsmasq

Just to be sure, your ubuntu is running on bare metal install ? Inside VM ?

I got issues with network settings behind a NAT correctly configured in vbox hyperviser.
This led to exactly same symptoms : DNS stopped working after a while, with no obvious reasons.

Switching to bare metal solved multiple issues (i know it may seem obvious but i had to give it a try :) ) and my setup is now working like a charm. Modulo some rare DNS issues caused by unattended-upgrades waking up some unwanted services. My bad, though, nothing related to subspace or wireguard.

BTW, it could be worth trying to disable (temporarily) unattended-upgrades on your host ?
This is quite debian / ubuntu specific, and could be the cause of some ... weirdnesses

Apologies for the late reply,

Ubuntu is inside VMware ESXi 6.7, behind Fortigate Firewall. I dont really have the desire to move it to bare-metal, nor do I think it should be necessary. I think having the ability to utilise DNS outside the container would provide.\

I have disabled unattended-upgrades.
The interesting this is after enabling and disabling systemd-resolved as above, im now able to utilise the search domain across clients, which is interesting..

Please see #144 - This appears to be the solution after preliminary testing.

Spoke too soon, issue remains.

Hi all, I've fixed this issue here
#144 (comment)

I recently setup a subspace server for a smaller group of people (~12) and hit this same issue. I can share what I found and fixed to resolve it for myself in the hopes that it may help someone else. I set this up on a very small vps since it uses practically no resources and, depending on our usage / traffic on the vpn, we would hit lack of dns resolution every couple hours. I setup nodeexporter and found the following

udprmemerrs

As the metrics show, I was out of ram for udp buffers and all udp traffic was being dropped. Restarting the container clears all this up but only temporarily. rmem is set to a measly 25mb by default so I upped it and haven't had the issue return since.

net.core.rmem_max=26214400
net.core.rmem_default=26214400

Hopefully this helps someone