/wg-mesh

wg-mesh

Primary LanguagePythonMIT LicenseMIT

wg-mesh

Work in Progress

Idea

Software

  • python3
  • wireguard (VPN)
  • bird2 (Routing, OSPF)

Network

  • By default 10.0.x.x/16 is used.
  • 10.0.id.1 Node /30
  • 10.0.id.4-255 peers /31
  • 10.0.251.1-255 vxlan /32

Features

  • automatic mesh buildup when node has joined
  • join nodes via cli
  • disconnect nodes via cli
  • VXLAN
  • Dualstack and/or Singlestack (Transport)
  • Dualstack (within the VPN Network)
  • Autostart Wireguard links on boot
  • Active Latency optimisation
  • Packet loss detection & rerouting
  • High Jitter detection & rerouting
  • Support for wgobfs (obfuscated wireguard)

Requirements

  • Debian or Ubuntu
  • Python 3.9 or higher
  • Kernel 5.4+ (wg kernel module, no user space support yet)

Keep in mind that some containers such as OVZ or LXC, depending on kernel version and host configuration have issues with bird and/or wireguard.

Example 2 nodes
The ID needs to be unique, otherwise it will result in collisions.
Keep in mind, ID's 200 and higher are reserved for clients, they won't get meshed.
Public is used to expose the API to all interfaces, by default it listens only local on 10.0.id.1.

#Install wg-mesh and initialize the first node
curl -so- https://raw.githubusercontent.com/Ne00n/wg-mesh/master/install.sh | bash -s -- init 1 public
#Install wg-mesh and initialize the second node
curl -so- https://raw.githubusercontent.com/Ne00n/wg-mesh/master/install.sh | bash -s -- init 2

Grab the Token from Node1

wgmesh token

Connect Node2 to Node1

wgmesh connect http://<node2IP>:8080 <token>

After connecting successfully, a dummy.sh will be created, which assigns a 10.0.nodeID.1/30 to lo.
This will be picked up by bird, so on booth nodes on 10.0.1.1 and 10.0.2.1 should be reachable after bird ran.
Regarding NAT or in general behind Firewalls, the "connector" is always a Client, the endpoint the Server.

Wireguard Port
If you like to change the default wireguard port.

wgmesh set basePort 4000 && systemctl restart wgmesh
#or 0 for random
wgmesh set basePort 0 && systemctl restart wgmesh

Prevent meshing
In case you want to stop a client/server from automatically meshing into the network.
You can simply block it by creating an empty state.json.

wgmesh disable mesh && systemctl restart wgmesh

This needs to be done before you connecting to the network.

Example 2+ nodes

#Install wg-mesh and initialize the first node
curl -so- https://raw.githubusercontent.com/Ne00n/wg-mesh/master/install.sh | bash -s -- init 1 public
#Install wg-mesh and initialize the second node
curl -so- https://raw.githubusercontent.com/Ne00n/wg-mesh/master/install.sh | bash -s -- init 2
#Install wg-mesh and initialize the third node
curl -so- https://raw.githubusercontent.com/Ne00n/wg-mesh/master/install.sh | bash -s -- init 3

Grab the Token from Node 1 with

wgmesh token

Connect Node 2 to Node 1

wgmesh connect http://<node1IP>:8080 <token>

Before you connect the 3rd node, make sure Node 2 already has fully connected.
Connect Node 3 to Node 1

wgmesh connect http://<node1IP>:8080 <token>

Wait for bird to pickup all routes + mesh buildup.
You can check it with

birdc show route
#and/or
cat /opt/wg-mesh/configs/state.json

All 3 nodes should be reachable under 10.0.nodeID.1

API
Currently the webservice / API is exposed at ::8080, without TLS, use a reverse proxy for TLS
Internal requests from 10.0.0.0/8 don't need a token (connectivity, connect and update).

  • /connectivity needs a valid token, otherwise will refuse to provide connectivity info
  • /connect needs a valid token, otherwise the service will refuse to setup a wg link
  • /update needs a valid wg public key and link name, otherwise it will not update the wg link
  • /disconnect needs a valid wg public key and link name, otherwise will refuse to disconnect a specific link

Shutdown/Startup

wgmesh down
wgmesh up && systemctl restart wgmesh

Disconnect
To disconnect all links on a Node

wgmesh disconnect
#disconnect all links despite untable to reach API endpoint
wgmesh disconnect force
#disconnect a specific link e.g pipe250, pipe250v6
wgmesh disconnect pipe250
#disconnect a specific link with force
wgmesh disconnect pipe250 force

Removal

wgmesh down && bash /opt/wg-mesh/deinstall.sh

Updating

wgmesh update && wgmesh migrate && systemctl restart wgmesh && systemctl restart wgmesh-bird

wgobfs
Install wgbofs with

bash /opt/wg-mesh/tools/wgobfs.sh

To enable wgobfs connections run.

#add wgobfs to linkTypes
wgmesh enable wgobfs 
#To override the defaultLinkType, if you want to prefer wgobfs over normal wg.
wgmesh set defaultLinkType wgobfs
systemctl restart wgmesh

If the remote has wgbofs not in linkeTypes, default will be used.

Limitations
Connecting multiple nodes at once, without waiting for the other node to finish, will result in double links.
By default, when a new node joins, it checks which connections it does not have, which with a new node would be everything.

Additional, bird2, by default, takes 30s to distribute the routes, there will be also a delay.
In total roughtly 60s, depending on the network size, to avoid this issue.

Depending on network conditions, bird will be reloaded, every 5 minutes or as short as every 20 seconds.
This will drop long lived TCP connections.

Known Issues

  • A client that does not have a direct connection to a newly added server, is stuck with a old outdated vxlan configuration.
    This can be "fixed" by reloading wgmesh-bird.

Troubleshooting

  • You can check the logs/
  • wg-mesh is very slow
    sudo requires a resolvable hostname
  • wg-mesh is not meshing
    bird2 needs to be running / hidepid can block said access to check if bird is running.
  • sudo is asking for authentication
    reinstall sudo, likely old config file (debian 10)
  • RTNETLINK answers: Address already in use
    Can also mean the Port wg tries to listen, is already in use. Check your existing wg links.
  • packetloss and/or higher latency inside the wg-mesh network but not on the uplink/network itself wireguard needs cpu time, check the load on the machine and check if you see any CPU steal.
    This will likely explain what you see for example on Smokeping, you can try to reduce the links to lower the cpu usage.
  • duplicate vxlan mac address / vxlan mac flapping
    If you are using a virtual machine, check your machine-id if they are the same.
    You can check it with or tools/machine-id.py
cat /etc/machine-id

Which can be easily fixed by running.

rm -f /etc/machine-id && rm -f /var/lib/dbus/machine-id
dbus-uuidgen --ensure && systemd-machine-id-setup
reboot