dcos/dcos-e2e

returned non-zero exit status 3

wiwengweng opened this issue · 4 comments

I am using miniDCOS to install and test dcos but without successful.

env:
ubuntu16.04
docker: Docker version 18.09.7, build 2d0083d
Python 3.6
minidcos, version 2019.06.19.0

(base) root@bogon:~/dcos-venv/bin# ./minidcos docker wait --transport docker-exec -vvv
and get the following output:

DEBUG:dcos_e2e.node:Running command `/bin/sh -c systemctl status sshd.socket || systemctl status sshd` on a node `Node(public_ip=172.17.0.5, private_ip=172.17.0.5)`
DEBUG:dcos_e2e._subprocess_tools:● sshd.socket - OpenSSH Server Socket
DEBUG:dcos_e2e._subprocess_tools:   Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled)
DEBUG:dcos_e2e._subprocess_tools:   Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools:     Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools:           man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:   Listen: 0.0.0.0:22 (Stream)
DEBUG:dcos_e2e._subprocess_tools: Accepted: 0; Connected: 0
DEBUG:dcos_e2e._subprocess_tools:● sshd.service - OpenSSH server daemon
DEBUG:dcos_e2e._subprocess_tools:   Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
DEBUG:dcos_e2e._subprocess_tools:   Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools:     Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools:           man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Starting OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on 0.0.0.0 port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on :: port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Started OpenSSH server daemon.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopping OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopped OpenSSH server daemon.
WARNING:retry.api:Command '['docker', 'exec', '--user', 'root', '--interactive', '3d603f5136ab808bcb4953e997a4128f71aaf1092dc355fc3592c746e3906afc', '/bin/sh', '-c', 'systemctl status sshd.socket || systemctl status sshd']' returned non-zero exit status 3., retrying in 1 seconds...
DEBUG:dcos_e2e.node:Running command `/bin/sh -c systemctl status sshd.socket || systemctl status sshd` on a node `Node(public_ip=172.17.0.5, private_ip=172.17.0.5)`
DEBUG:dcos_e2e._subprocess_tools:● sshd.socket - OpenSSH Server Socket
DEBUG:dcos_e2e._subprocess_tools:   Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled)
DEBUG:dcos_e2e._subprocess_tools:   Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools:     Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools:           man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:   Listen: 0.0.0.0:22 (Stream)
DEBUG:dcos_e2e._subprocess_tools: Accepted: 0; Connected: 0
DEBUG:dcos_e2e._subprocess_tools:● sshd.service - OpenSSH server daemon
DEBUG:dcos_e2e._subprocess_tools:   Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
DEBUG:dcos_e2e._subprocess_tools:   Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools:     Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools:           man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Starting OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on 0.0.0.0 port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on :: port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Started OpenSSH server daemon.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopping OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopped OpenSSH server daemon.
WARNING:retry.api:Command '['docker', 'exec', '--user', 'root', '--interactive', '3d603f5136ab808bcb4953e997a4128f71aaf1092dc355fc3592c746e3906afc', '/bin/sh', '-c', 'systemctl status sshd.socket || systemctl status sshd']' returned non-zero exit status 3., retrying in 1 seconds...
DEBUG:dcos_e2e.node:Running command `/bin/sh -c systemctl status sshd.socket || systemctl status sshd` on a node `Node(public_ip=172.17.0.5, private_ip=172.17.0.5)`
DEBUG:dcos_e2e._subprocess_tools:● sshd.socket - OpenSSH Server Socket
DEBUG:dcos_e2e._subprocess_tools:   Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled)
DEBUG:dcos_e2e._subprocess_tools:   Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools:     Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools:           man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:   Listen: 0.0.0.0:22 (Stream)
DEBUG:dcos_e2e._subprocess_tools: Accepted: 0; Connected: 0
DEBUG:dcos_e2e._subprocess_tools:● sshd.service - OpenSSH server daemon
DEBUG:dcos_e2e._subprocess_tools:   Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
DEBUG:dcos_e2e._subprocess_tools:   Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools:     Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools:           man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Starting OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on 0.0.0.0 port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on :: port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Started OpenSSH server daemon.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopping OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopped OpenSSH server daemon.
WARNING:retry.api:Command '['docker', 'exec', '--user', 'root', '--interactive', '3d603f5136ab808bcb4953e997a4128f71aaf1092dc355fc3592c746e3906afc', '/bin/sh', '-c', 'systemctl status sshd.socket || systemctl status sshd']' returned non-zero exit status 3., retrying in 1 seconds...
DEBUG:dcos_e2e.node:Running command `/bin/sh -c systemctl status sshd.socket || systemctl status sshd` on a node `Node(public_ip=172.17.0.5, private_ip=172.17.0.5)`
DEBUG:dcos_e2e._subprocess_tools:● sshd.socket - OpenSSH Server Socket
DEBUG:dcos_e2e._subprocess_tools:   Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled)
DEBUG:dcos_e2e._subprocess_tools:   Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools:     Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools:           man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:   Listen: 0.0.0.0:22 (Stream)
DEBUG:dcos_e2e._subprocess_tools: Accepted: 0; Connected: 0
DEBUG:dcos_e2e._subprocess_tools:● sshd.service - OpenSSH server daemon
DEBUG:dcos_e2e._subprocess_tools:   Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
DEBUG:dcos_e2e._subprocess_tools:   Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools:     Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools:           man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Starting OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on 0.0.0.0 port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on :: port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Started OpenSSH server daemon.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopping OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopped OpenSSH server daemon.
WARNING:retry.api:Command '['docker', 'exec', '--user', 'root', '--interactive', '3d603f5136ab808bcb4953e997a4128f71aaf1092dc355fc3592c746e3906afc', '/bin/sh', '-c', 'systemctl status sshd.socket || systemctl status sshd']' returned non-zero exit status 3., retrying in 1 seconds...
DEBUG:dcos_e2e.node:Running command `/bin/sh -c systemctl status sshd.socket || systemctl status sshd` on a node `Node(public_ip=172.17.0.5, private_ip=172.17.0.5)`
DEBUG:dcos_e2e._subprocess_tools:● sshd.socket - OpenSSH Server Socket
DEBUG:dcos_e2e._subprocess_tools:   Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled)
DEBUG:dcos_e2e._subprocess_tools:   Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools:     Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools:           man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:   Listen: 0.0.0.0:22 (Stream)
DEBUG:dcos_e2e._subprocess_tools: Accepted: 0; Connected: 0
DEBUG:dcos_e2e._subprocess_tools:● sshd.service - OpenSSH server daemon
DEBUG:dcos_e2e._subprocess_tools:   Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
DEBUG:dcos_e2e._subprocess_tools:   Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools:     Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools:           man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Starting OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on 0.0.0.0 port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on :: port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Started OpenSSH server daemon.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopping OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopped OpenSSH server daemon.
Traceback (most recent call last):
  File "./minidcos", line 10, in <module>
    sys.exit(minidcos())
  File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/root/dcos-venv/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e_cli/dcos_docker/commands/wait.py", line 70, in wait
    cluster=cluster_containers.cluster,
  File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e_cli/dcos_docker/commands/_common.py", line 190, in cluster
    public_agents=set(map(self.to_node, self.public_agents)),
  File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/cluster.py", line 116, in from_nodes
    cluster_backend=backend,
  File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/cluster.py", line 86, in __init__
    _wait_for_ssh(node=node)
  File "</root/dcos-venv/lib/python3.6/site-packages/decorator.py:decorator-gen-6>", line 2, in _wait_for_ssh
  File "/root/dcos-venv/lib/python3.6/site-packages/retry/api.py", line 74, in retry_decorator
    logger)
  File "/root/dcos-venv/lib/python3.6/site-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/cluster.py", line 46, in _wait_for_ssh
    shell=True,
  File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/node.py", line 520, in run
    capture_output=capture_output,
  File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/_node_transports/_docker_exec_transport.py", line 125, in run
    pipe_output=capture_output,
  File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/_subprocess_tools.py", line 145, in run_subprocess
    stderr=stderr,
subprocess.CalledProcessError: Command '['docker', 'exec', '--user', 'root', '--interactive', '3d603f5136ab808bcb4953e997a4128f71aaf1092dc355fc3592c746e3906afc', '/bin/sh', '-c', 'systemctl status sshd.socket || systemctl status sshd']' returned non-zero exit status 3.

Did it use the ssh service? I also try to enable sshd service and start the service, but again see the last error.

as I ran on, I recreate the cluster again. But still meet some errors.

  1. this time ssh is up, but others are not.
⠴ Waiting for DC/OS to startWARNING:retry.api:Command '['docker', 'exec', '--user', 'root', '--interactive', '2669ffd6c3f2633df6fb3be6fad73fd8f5080918f825465e376b48d3b914aa72', '/bin/sh', '-c', 'sudo /opt/mesosphere/bin/dcos-check-runner check node-poststart || sudo /opt/mesosphere/bin/dcos-diagnostics check node-poststart || /opt/mesosphere/bin/3dt --diag']' returned non-zero exit status 127., retrying in 10 seconds...

It means command returned status is not correct. So I log into master node, and check the commands and services.

  1. some of the commands does not exist.
[root@dcos-e2e-default-766cf-master-0 bin]# /opt/mesosphere/bin/dcos-diagnostics --verbose check node-poststart
Error: unknown command "check" for "dcos-diagnostics"
Run 'dcos-diagnostics --help' for usage.
unknown command "check" for "dcos-diagnostics"

in fact dcos-diagnostics dose not support check command anymore, and here is the help file.

Usage:
  dcos-diagnostics [flags]
  dcos-diagnostics [command]

Available Commands:
  daemon      Start an http server and polls the components health.
  help        Help about any command

Flags:
      --config string   config file (default is $HOME/.dcos-diagnostics.yaml)
      --diag            Check DC/OS components health.
  -h, --help            help for dcos-diagnostics
      --role string     Set node role
      --verbose         Use verbose debug output.
      --version         Print dcos-diagnostics version
  1. 3dt command dose not exist.
[root@dcos-e2e-default-766cf-master-0 bin]#  /opt/mesosphere/bin/3dt --diag
bash: /opt/mesosphere/bin/3dt: No such file or directory
  1. key services are not running good.
[root@dcos-e2e-default-766cf-master-0 bin]# systemctl | grep dcos-
  dcos-adminrouter.service             loaded activating auto-restart       Admin Router Master: Uses NGINX to expose a unified control plane proxy for components and services
  dcos-bouncer.service                 loaded activating auto-restart       DC/OS Identity and Access Manager (Bouncer): offers login methods and user management
  dcos-checks-api.service              loaded active     running            DC/OS Checks API: Provides node and cluster checks
● dcos-checks-poststart.service        loaded failed     failed             DC/OS Poststart Checks: Run node-poststart checks

and these services cannot be restarted using systemctl restart command. Hope someone can help me. TIA

Thank you @wiwengweng for this thorough bug report.
I appreciate that the logging could be better.

The rough order of operations on the wait command is this:

  • Start nodes
  • Wait for the SSH service to start up on all nodes (this is what did not work the first time for you, I am unsure why)
  • Wait for DC/OS checks to pass

The last of these has to support multiple versions of DC/OS.
It runs multiple commands on each node, to support each version of DC/OS.
Only one of those needs to pass.
It runs those commands on repeat until the services are running well.

The following is a related comment in the code:

    The execution will differ for different version of DC/OS.
    ``dcos-check-runner`` only exists on DC/OS 1.12+. ``dcos-diagnostics
    check node-poststart`` only works on DC/OS 1.10 and 1.11. ``3dt`` only
    exists on DC/OS 1.9. ``node-poststart`` requires ``sudo`` to allow
    reading the CA certificate used by certain checks.

This means that it is not a problem for example that 3dt does not exist on your nodes.

as I ran on, I recreate the cluster again. But still meet some errors.

How long did you wait to see if the services just started up?

thanks for your response, adamtheturtle. I think I have waited for about 30mins+ before I see this error. Then I also try to manually start sshd in all nodes, and be sure they are all alive. But still see some services are "down/auto restart" state in my master container.

  dcos-adminrouter.service             loaded activating start-pre    start Admin Router Master: Uses NGINX to expose a unified control plane proxy for components and services
  dcos-bouncer.service                 loaded activating auto-restart       DC/OS Identity and Access Manager (Bouncer): offers login methods and user management
  dcos-checks-api.service              loaded active     running            DC/OS Checks API: Provides node and cluster checks
● dcos-checks-poststart.service        loaded failed     failed             DC/OS Poststart Checks: Run node-poststart checks
  dcos-cockroach.service               loaded activating auto-restart       CockroachDB: Database for the DC/OS IAM

by this command line docker exec --user root --interactive 4449ed14b85f /bin/sh -c systemctl status sshd.socket || systemctl status sshd

and here 4449ed14b85f is just my master container ID