ehazlett/stellar

invalid CIDR

mabunixda opened this issue · 7 comments

Hi,

i tested stellar on 2 VMs before going on baremetal. Everything fine - now on baremetal i get the following error log

$ stellar config --nic enp3s0 > stellar.config
$ stellar -D server --config ./stellar.config
DEBU[0000] seed peers seedPeers="[]"
DEBU[0000] getPeersFromCache: []
DEBU[0000] cluster peers peers="[]" seed_peers="[]"
INFO[0000] registered service id=stellar.services.version.v1
INFO[0000] registered service id=stellar.services.node.v1
INFO[0000] registered service id=stellar.services.health.v1
INFO[0000] registered service id=stellar.services.cluster.v1
INFO[0000] registered service id=stellar.services.datastore.v1
INFO[0000] registered service id=stellar.services.gateway.v1
INFO[0000] registered service id=stellar.services.network.v1
INFO[0000] registered service id=stellar.services.application.v1
INFO[0000] registered service id=stellar.services.nameserver.v1
INFO[0000] registered service id=stellar.services.proxy.v1
INFO[0000] registered service id=stellar.services.events.v1
DEBU[0000] starting server agent
DEBU[0000] starting grpc server addr="172.16.0.6:9000"
DEBU[0000] initializing server
DEBU[0000] network init
DEBU[0000] allocating network subnet for node y
DEBU[0000] service.network allocating subnet
FATA[0000] invalid CIDR address:

The configfile sounds similar:

{
"NodeID": "y",
"GRPCAddress": "172.16.0.6:9000",
"TLSServerCertificate": "",
"TLSServerKey": "",
"TLSClientCertificate": "",
"TLSClientKey": "",
"TLSInsecureSkipVerify": false,
"ContainerdAddr": "/run/containerd/containerd.sock",
"Namespace": "default",
"DataDir": "/var/lib/stellar",
"StateDir": "/run/stellar",
"Bridge": "stellar0",
"UpstreamDNSAddr": "8.8.8.8:53",
"ProxyHTTPPort": 80,
"ProxyHTTPSPort": 443,
"ProxyTLSEmail": "",
"GatewayAddress": "172.16.0.6:9001",
"EventsAddress": "172.16.0.6:4222",
"EventsClusterAddress": "172.16.0.6:5222",
"EventsHTTPAddress": "172.16.0.6:4322",
"CNIBinPaths": [
"/opt/containerd/bin",
"/opt/cni/bin"
],
"ConnectionType": "local",
"ClusterAddress": "172.16.0.6:7946",
"AdvertiseAddress": "172.16.0.6:7946",
"Debug": false,
"Peers": [],
"Subnet": "172.16.0.0/12",
"ProxyHealthcheckInterval": "5s"
}

I tried to modify the subnet, reviewed the code where the output comes from but i cannot get the cause of this failure :-(

Thanks, Martin

Hmm ya everything looks OK in the config. What does ip a s show for your network devices?

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast master br0 state DOWN group default qlen 1000
link/ether 00:01:2e:78:2b:e2 brd ff:ff:ff:ff:ff:ff
3: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:01:2e:78:2b:e3 brd ff:ff:ff:ff:ff:ff
inet 172.16.0.6/24 brd 172.16.0.255 scope global enp3s0
valid_lft forever preferred_lft forever
inet6 fe80::201:2eff:fe78:2be3/64 scope link
valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:bc:c6:52:c8 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever

I'm trying to re-create this. In the meantime I've pushed some more debug logging around the subnet allocation. Can you try the latest master to see if that can tell us anymore? Thanks!

new output is

DEBU[0000] service.network allocating subnet
DEBU[0000] local subnet from datastore subnet="[60 110 105 108 62]"
FATA[0000] error parsing subnet "" (): invalid CIDR address:

OK for some reason the subnet is <nil> in the db. I'm not sure how it would have received that. I'm going to do some debug and see if we can add some checks to prevent erroneous routes from being assigned.

with the latest debug information the root cause was a misconfigured datastore from an initial startup where I used subnet 192.168.0.0/24 because my local LAN setup is within 172.16.0.0/16 .. I removed the local data store in /var/lib/stellar and the startup worked with the default setup but killed my routing on the box :-D
After switching to 192.168.0.0/16 stellar started and I was able to use sctl and also deploy the sample.

Must the subnet be /16 range?

Thanks for the update! I was looking at a way to clean up the boltdb. This is a bug :)

First, we need to detect when the Subnet changes in the config as right now if there is a subnet (good or bad) in the db it will use that.

Second, there appears to be an issue with the subnet division. If /24 is used it does not calculate a valid route for some reason. I'm going to debug this and add some tests for various subnets.

Thanks for the debug! It helps tremendously 👍