Cgroups not mounting
rboarman-sc opened this issue · 6 comments
Using your example, I was able to launch a Consul cluster (working fine) and a Nomad cluster which successfully connects to Consul.
However, two of the drivers, java and exec, are failing to load due to error "Cgroup mount point unavailable."
Nomad client log file:
==> Loaded configuration from /opt/nomad/config/default.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:
Advertise Addrs: HTTP: 172.31.21.117:4646
Bind Addrs: HTTP: 0.0.0.0:4646
Client: true
Log Level: DEBUG
Region: us-west-2 (DC: us-west-2b)
Server: false
Version: 0.9.4
==> Nomad agent started! Log data will stream in below:
2019-08-07T17:51:10.239Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/data/plugins
2019-08-07T17:51:10.305Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/data/plugins
2019-08-07T17:51:10.305Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/data/plugins
2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
2019-08-07T17:51:10.307Z [INFO ] client: using state directory: state_dir=/opt/nomad/data/client
2019-08-07T17:51:10.327Z [INFO ] client: using alloc directory: alloc_dir=/opt/nomad/data/alloc
2019-08-07T17:51:10.331Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters="[arch cgroup consul cpu host memory network nomad signal storage vault env_gce env_aws]"
2019-08-07T17:51:10.333Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup period=15s
2019-08-07T17:51:10.335Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=2400
2019-08-07T17:51:10.335Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=1
2019-08-07T17:51:10.337Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul period=15s
2019-08-07T17:51:10.348Z [WARN ] client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=eth0
2019-08-07T17:51:10.348Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/eth0/speed
2019-08-07T17:51:10.348Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: mbits=1000
2019-08-07T17:51:10.348Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=eth0 IP=172.31.21.117
2019-08-07T17:51:10.355Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault period=15s
2019-08-07T17:51:10.373Z [DEBUG] client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type resp_code=404
2019-08-07T17:51:10.373Z [DEBUG] client.fingerprint_mgr: detected fingerprints: node_attrs="[arch cpu host network nomad signal storage env_aws]"
2019-08-07T17:51:10.373Z [INFO ] client.plugin: starting plugin manager: plugin-type=driver
2019-08-07T17:51:10.373Z [INFO ] client.plugin: starting plugin manager: plugin-type=device
2019-08-07T17:51:10.400Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused"
2019-08-07T17:51:10.400Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
2019-08-07T17:51:10.400Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
2019-08-07T17:51:10.400Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
2019-08-07T17:51:10.400Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
2019-08-07T17:51:10.407Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=unhealthy description="Cgroup mount point unavailable"
2019-08-07T17:51:10.407Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=
2019-08-07T17:51:10.407Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=java health=unhealthy description="Cgroup mount point unavailable"
2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=unix:///var/run/docker.sock error="Get http://unix.sock/version: dial unix /var/run/docker.sock: connect: no such file or directory"
2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Failed to connect to docker daemon"
2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=rkt health=undetected description="Failed to execute rkt version: exec: "rkt": executable file not found in $PATH"
2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr: detected drivers: drivers="map[undetected:[qemu docker rkt] healthy:[raw_exec] unhealthy:[exec java]]"
2019-08-07T17:51:10.411Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
2019-08-07T17:51:10.411Z [INFO ] client: started client: node_id=7b3d2591-71fa-9d92-d949-2a748099420b
2019-08-07T17:51:10.414Z [WARN ] client.server_mgr: no servers available
2019-08-07T17:51:10.414Z [DEBUG] client: registration waiting on servers
2019-08-07T17:51:10.414Z [WARN ] client.server_mgr: no servers available
2019-08-07T17:51:10.415Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused"
2019-08-07T17:51:13.468Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=client error="{"client":{"ok":false,"message":"no known servers"}}" code=500
2019-08-07T17:51:13.468Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=568.093µs
2019-08-07T17:51:23.755Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=client error="{"client":{"ok":false,"message":"no known servers"}}" code=500
2019-08-07T17:51:23.755Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=140.767µs
2019-08-07T17:51:25.625Z [INFO ] client.fingerprint_mgr.consul: consul agent is available
2019-08-07T17:51:30.745Z [WARN ] client.server_mgr: no servers available
2019-08-07T17:51:30.745Z [DEBUG] client: registration waiting on servers
2019-08-07T17:51:30.747Z [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=[us-west-2]
2019-08-07T17:51:30.765Z [INFO ] client.consul: discovered following servers: servers=172.31.13.97:4647
2019-08-07T17:51:30.765Z [DEBUG] client.server_mgr: new server list: new_servers=172.31.13.97:4647 old_servers=
2019-08-07T17:51:30.777Z [DEBUG] client: updated allocations: index=1 total=0 pulled=0 filtered=0
2019-08-07T17:51:30.778Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=0
2019-08-07T17:51:30.778Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
2019-08-07T17:51:30.781Z [INFO ] client: node registration complete
2019-08-07T17:51:33.756Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=270.264µs
2019-08-07T17:51:35.753Z [DEBUG] client: state updated: node_status=ready
2019-08-07T17:51:38.116Z [DEBUG] client: state changed, updating node and re-registering
2019-08-07T17:51:38.121Z [INFO ] client: node registration complete
2019-08-07T17:51:43.757Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=159.506µs
The configuration is directly from your example code except I set the number of servers and clients to one.
Please advise.
What OS? What version of Nomad?
Sorry, I should have included that.
Amazon Linux 2: Linux 4.14.133-88.112.amzn1.x86_64 x86_64
Nomad: 0.9.4
I'll have a look at that now! Sorry for the delay :)
Just to confirm and so it is easier for me to help you debug this, which example did you follow, the root example where the consul servers and the nomad servers are co-located? Or the one where you have 3 separate clusters?
2019-08-07T17:51:10.400Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused"
This line is interesting... It looks like the nomad client is failing to reach localhost at port 8500 and check through the consul client where the respective servers are located.