hashicorp/terraform-aws-nomad

Cgroups not mounting

rboarman-sc opened this issue · 6 comments

Using your example, I was able to launch a Consul cluster (working fine) and a Nomad cluster which successfully connects to Consul.

However, two of the drivers, java and exec, are failing to load due to error "Cgroup mount point unavailable."

Nomad client log file:

==> Loaded configuration from /opt/nomad/config/default.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 172.31.21.117:4646
            Bind Addrs: HTTP: 0.0.0.0:4646
                Client: true
             Log Level: DEBUG
                Region: us-west-2 (DC: us-west-2b)
                Server: false
               Version: 0.9.4

==> Nomad agent started! Log data will stream in below:

    2019-08-07T17:51:10.239Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/data/plugins
    2019-08-07T17:51:10.305Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/data/plugins
    2019-08-07T17:51:10.305Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/data/plugins
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2019-08-07T17:51:10.305Z [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
    2019-08-07T17:51:10.307Z [INFO ] client: using state directory: state_dir=/opt/nomad/data/client
    2019-08-07T17:51:10.327Z [INFO ] client: using alloc directory: alloc_dir=/opt/nomad/data/alloc
    2019-08-07T17:51:10.331Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters="[arch cgroup consul cpu host memory network nomad signal storage vault env_gce env_aws]"
    2019-08-07T17:51:10.333Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup period=15s
    2019-08-07T17:51:10.335Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=2400
    2019-08-07T17:51:10.335Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=1
    2019-08-07T17:51:10.337Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul period=15s
    2019-08-07T17:51:10.348Z [WARN ] client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=eth0
    2019-08-07T17:51:10.348Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/eth0/speed
    2019-08-07T17:51:10.348Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: mbits=1000
    2019-08-07T17:51:10.348Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=eth0 IP=172.31.21.117
    2019-08-07T17:51:10.355Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault period=15s
    2019-08-07T17:51:10.373Z [DEBUG] client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type resp_code=404
    2019-08-07T17:51:10.373Z [DEBUG] client.fingerprint_mgr: detected fingerprints: node_attrs="[arch cpu host network nomad signal storage env_aws]"
    2019-08-07T17:51:10.373Z [INFO ] client.plugin: starting plugin manager: plugin-type=driver
    2019-08-07T17:51:10.373Z [INFO ] client.plugin: starting plugin manager: plugin-type=device
    2019-08-07T17:51:10.400Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused"
    2019-08-07T17:51:10.400Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
    2019-08-07T17:51:10.400Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
    2019-08-07T17:51:10.400Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
    2019-08-07T17:51:10.400Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
    2019-08-07T17:51:10.407Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=unhealthy description="Cgroup mount point unavailable"
    2019-08-07T17:51:10.407Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=
    2019-08-07T17:51:10.407Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=java health=unhealthy description="Cgroup mount point unavailable"
    2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=unix:///var/run/docker.sock error="Get http://unix.sock/version: dial unix /var/run/docker.sock: connect: no such file or directory"
    2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Failed to connect to docker daemon"
    2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=rkt health=undetected description="Failed to execute rkt version: exec: "rkt": executable file not found in $PATH"
    2019-08-07T17:51:10.411Z [DEBUG] client.driver_mgr: detected drivers: drivers="map[undetected:[qemu docker rkt] healthy:[raw_exec] unhealthy:[exec java]]"
    2019-08-07T17:51:10.411Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
    2019-08-07T17:51:10.411Z [INFO ] client: started client: node_id=7b3d2591-71fa-9d92-d949-2a748099420b
    2019-08-07T17:51:10.414Z [WARN ] client.server_mgr: no servers available
    2019-08-07T17:51:10.414Z [DEBUG] client: registration waiting on servers
    2019-08-07T17:51:10.414Z [WARN ] client.server_mgr: no servers available
    2019-08-07T17:51:10.415Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused"
    2019-08-07T17:51:13.468Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=client error="{"client":{"ok":false,"message":"no known servers"}}" code=500
    2019-08-07T17:51:13.468Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=568.093µs
    2019-08-07T17:51:23.755Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=client error="{"client":{"ok":false,"message":"no known servers"}}" code=500
    2019-08-07T17:51:23.755Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=140.767µs
    2019-08-07T17:51:25.625Z [INFO ] client.fingerprint_mgr.consul: consul agent is available
    2019-08-07T17:51:30.745Z [WARN ] client.server_mgr: no servers available
    2019-08-07T17:51:30.745Z [DEBUG] client: registration waiting on servers
    2019-08-07T17:51:30.747Z [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=[us-west-2]
    2019-08-07T17:51:30.765Z [INFO ] client.consul: discovered following servers: servers=172.31.13.97:4647
    2019-08-07T17:51:30.765Z [DEBUG] client.server_mgr: new server list: new_servers=172.31.13.97:4647 old_servers=
    2019-08-07T17:51:30.777Z [DEBUG] client: updated allocations: index=1 total=0 pulled=0 filtered=0
    2019-08-07T17:51:30.778Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=0
    2019-08-07T17:51:30.778Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
    2019-08-07T17:51:30.781Z [INFO ] client: node registration complete
    2019-08-07T17:51:33.756Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=270.264µs
    2019-08-07T17:51:35.753Z [DEBUG] client: state updated: node_status=ready
    2019-08-07T17:51:38.116Z [DEBUG] client: state changed, updating node and re-registering
    2019-08-07T17:51:38.121Z [INFO ] client: node registration complete
    2019-08-07T17:51:43.757Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=159.506µs

The configuration is directly from your example code except I set the number of servers and clients to one.

Please advise.

What OS? What version of Nomad?

Sorry, I should have included that.

Amazon Linux 2: Linux 4.14.133-88.112.amzn1.x86_64 x86_64
Nomad: 0.9.4

@Etiene Any chance you could look into this one?

@Etiene @brikis98 Any word on this? Thanks!

I'll have a look at that now! Sorry for the delay :)

Just to confirm and so it is easier for me to help you debug this, which example did you follow, the root example where the consul servers and the nomad servers are co-located? Or the one where you have 3 separate clusters?

2019-08-07T17:51:10.400Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused"

This line is interesting... It looks like the nomad client is failing to reach localhost at port 8500 and check through the consul client where the respective servers are located.