returned non-zero exit status 3
wiwengweng opened this issue · 4 comments
I am using miniDCOS to install and test dcos but without successful.
env:
ubuntu16.04
docker: Docker version 18.09.7, build 2d0083d
Python 3.6
minidcos, version 2019.06.19.0
(base) root@bogon:~/dcos-venv/bin# ./minidcos docker wait --transport docker-exec -vvv
and get the following output:
DEBUG:dcos_e2e.node:Running command `/bin/sh -c systemctl status sshd.socket || systemctl status sshd` on a node `Node(public_ip=172.17.0.5, private_ip=172.17.0.5)`
DEBUG:dcos_e2e._subprocess_tools:● sshd.socket - OpenSSH Server Socket
DEBUG:dcos_e2e._subprocess_tools: Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled)
DEBUG:dcos_e2e._subprocess_tools: Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools: Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools: man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools: Listen: 0.0.0.0:22 (Stream)
DEBUG:dcos_e2e._subprocess_tools: Accepted: 0; Connected: 0
DEBUG:dcos_e2e._subprocess_tools:● sshd.service - OpenSSH server daemon
DEBUG:dcos_e2e._subprocess_tools: Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
DEBUG:dcos_e2e._subprocess_tools: Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools: Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools: man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Starting OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on 0.0.0.0 port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on :: port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Started OpenSSH server daemon.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopping OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopped OpenSSH server daemon.
WARNING:retry.api:Command '['docker', 'exec', '--user', 'root', '--interactive', '3d603f5136ab808bcb4953e997a4128f71aaf1092dc355fc3592c746e3906afc', '/bin/sh', '-c', 'systemctl status sshd.socket || systemctl status sshd']' returned non-zero exit status 3., retrying in 1 seconds...
DEBUG:dcos_e2e.node:Running command `/bin/sh -c systemctl status sshd.socket || systemctl status sshd` on a node `Node(public_ip=172.17.0.5, private_ip=172.17.0.5)`
DEBUG:dcos_e2e._subprocess_tools:● sshd.socket - OpenSSH Server Socket
DEBUG:dcos_e2e._subprocess_tools: Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled)
DEBUG:dcos_e2e._subprocess_tools: Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools: Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools: man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools: Listen: 0.0.0.0:22 (Stream)
DEBUG:dcos_e2e._subprocess_tools: Accepted: 0; Connected: 0
DEBUG:dcos_e2e._subprocess_tools:● sshd.service - OpenSSH server daemon
DEBUG:dcos_e2e._subprocess_tools: Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
DEBUG:dcos_e2e._subprocess_tools: Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools: Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools: man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Starting OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on 0.0.0.0 port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on :: port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Started OpenSSH server daemon.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopping OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopped OpenSSH server daemon.
WARNING:retry.api:Command '['docker', 'exec', '--user', 'root', '--interactive', '3d603f5136ab808bcb4953e997a4128f71aaf1092dc355fc3592c746e3906afc', '/bin/sh', '-c', 'systemctl status sshd.socket || systemctl status sshd']' returned non-zero exit status 3., retrying in 1 seconds...
DEBUG:dcos_e2e.node:Running command `/bin/sh -c systemctl status sshd.socket || systemctl status sshd` on a node `Node(public_ip=172.17.0.5, private_ip=172.17.0.5)`
DEBUG:dcos_e2e._subprocess_tools:● sshd.socket - OpenSSH Server Socket
DEBUG:dcos_e2e._subprocess_tools: Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled)
DEBUG:dcos_e2e._subprocess_tools: Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools: Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools: man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools: Listen: 0.0.0.0:22 (Stream)
DEBUG:dcos_e2e._subprocess_tools: Accepted: 0; Connected: 0
DEBUG:dcos_e2e._subprocess_tools:● sshd.service - OpenSSH server daemon
DEBUG:dcos_e2e._subprocess_tools: Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
DEBUG:dcos_e2e._subprocess_tools: Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools: Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools: man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Starting OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on 0.0.0.0 port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on :: port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Started OpenSSH server daemon.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopping OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopped OpenSSH server daemon.
WARNING:retry.api:Command '['docker', 'exec', '--user', 'root', '--interactive', '3d603f5136ab808bcb4953e997a4128f71aaf1092dc355fc3592c746e3906afc', '/bin/sh', '-c', 'systemctl status sshd.socket || systemctl status sshd']' returned non-zero exit status 3., retrying in 1 seconds...
DEBUG:dcos_e2e.node:Running command `/bin/sh -c systemctl status sshd.socket || systemctl status sshd` on a node `Node(public_ip=172.17.0.5, private_ip=172.17.0.5)`
DEBUG:dcos_e2e._subprocess_tools:● sshd.socket - OpenSSH Server Socket
DEBUG:dcos_e2e._subprocess_tools: Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled)
DEBUG:dcos_e2e._subprocess_tools: Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools: Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools: man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools: Listen: 0.0.0.0:22 (Stream)
DEBUG:dcos_e2e._subprocess_tools: Accepted: 0; Connected: 0
DEBUG:dcos_e2e._subprocess_tools:● sshd.service - OpenSSH server daemon
DEBUG:dcos_e2e._subprocess_tools: Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
DEBUG:dcos_e2e._subprocess_tools: Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools: Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools: man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Starting OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on 0.0.0.0 port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on :: port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Started OpenSSH server daemon.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopping OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopped OpenSSH server daemon.
WARNING:retry.api:Command '['docker', 'exec', '--user', 'root', '--interactive', '3d603f5136ab808bcb4953e997a4128f71aaf1092dc355fc3592c746e3906afc', '/bin/sh', '-c', 'systemctl status sshd.socket || systemctl status sshd']' returned non-zero exit status 3., retrying in 1 seconds...
DEBUG:dcos_e2e.node:Running command `/bin/sh -c systemctl status sshd.socket || systemctl status sshd` on a node `Node(public_ip=172.17.0.5, private_ip=172.17.0.5)`
DEBUG:dcos_e2e._subprocess_tools:● sshd.socket - OpenSSH Server Socket
DEBUG:dcos_e2e._subprocess_tools: Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled)
DEBUG:dcos_e2e._subprocess_tools: Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools: Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools: man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools: Listen: 0.0.0.0:22 (Stream)
DEBUG:dcos_e2e._subprocess_tools: Accepted: 0; Connected: 0
DEBUG:dcos_e2e._subprocess_tools:● sshd.service - OpenSSH server daemon
DEBUG:dcos_e2e._subprocess_tools: Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled; vendor preset: enabled)
DEBUG:dcos_e2e._subprocess_tools: Active: inactive (dead)
DEBUG:dcos_e2e._subprocess_tools: Docs: man:sshd(8)
DEBUG:dcos_e2e._subprocess_tools: man:sshd_config(5)
DEBUG:dcos_e2e._subprocess_tools:
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Starting OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on 0.0.0.0 port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 sshd[243]: Server listening on :: port 22.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:48:48 dcos-e2e-default-07034-public-agent-0 systemd[1]: Started OpenSSH server daemon.
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopping OpenSSH server daemon...
DEBUG:dcos_e2e._subprocess_tools:Jul 04 03:56:32 dcos-e2e-default-07034-public-agent-0 systemd[1]: Stopped OpenSSH server daemon.
Traceback (most recent call last):
File "./minidcos", line 10, in <module>
sys.exit(minidcos())
File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/dcos-venv/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/root/dcos-venv/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e_cli/dcos_docker/commands/wait.py", line 70, in wait
cluster=cluster_containers.cluster,
File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e_cli/dcos_docker/commands/_common.py", line 190, in cluster
public_agents=set(map(self.to_node, self.public_agents)),
File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/cluster.py", line 116, in from_nodes
cluster_backend=backend,
File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/cluster.py", line 86, in __init__
_wait_for_ssh(node=node)
File "</root/dcos-venv/lib/python3.6/site-packages/decorator.py:decorator-gen-6>", line 2, in _wait_for_ssh
File "/root/dcos-venv/lib/python3.6/site-packages/retry/api.py", line 74, in retry_decorator
logger)
File "/root/dcos-venv/lib/python3.6/site-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/cluster.py", line 46, in _wait_for_ssh
shell=True,
File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/node.py", line 520, in run
capture_output=capture_output,
File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/_node_transports/_docker_exec_transport.py", line 125, in run
pipe_output=capture_output,
File "/root/dcos-venv/lib/python3.6/site-packages/dcos_e2e/_subprocess_tools.py", line 145, in run_subprocess
stderr=stderr,
subprocess.CalledProcessError: Command '['docker', 'exec', '--user', 'root', '--interactive', '3d603f5136ab808bcb4953e997a4128f71aaf1092dc355fc3592c746e3906afc', '/bin/sh', '-c', 'systemctl status sshd.socket || systemctl status sshd']' returned non-zero exit status 3.
Did it use the ssh service? I also try to enable sshd service and start the service, but again see the last error.
as I ran on, I recreate the cluster again. But still meet some errors.
- this time ssh is up, but others are not.
⠴ Waiting for DC/OS to startWARNING:retry.api:Command '['docker', 'exec', '--user', 'root', '--interactive', '2669ffd6c3f2633df6fb3be6fad73fd8f5080918f825465e376b48d3b914aa72', '/bin/sh', '-c', 'sudo /opt/mesosphere/bin/dcos-check-runner check node-poststart || sudo /opt/mesosphere/bin/dcos-diagnostics check node-poststart || /opt/mesosphere/bin/3dt --diag']' returned non-zero exit status 127., retrying in 10 seconds...
It means command returned status is not correct. So I log into master node, and check the commands and services.
- some of the commands does not exist.
[root@dcos-e2e-default-766cf-master-0 bin]# /opt/mesosphere/bin/dcos-diagnostics --verbose check node-poststart
Error: unknown command "check" for "dcos-diagnostics"
Run 'dcos-diagnostics --help' for usage.
unknown command "check" for "dcos-diagnostics"
in fact dcos-diagnostics
dose not support check command anymore, and here is the help file.
Usage:
dcos-diagnostics [flags]
dcos-diagnostics [command]
Available Commands:
daemon Start an http server and polls the components health.
help Help about any command
Flags:
--config string config file (default is $HOME/.dcos-diagnostics.yaml)
--diag Check DC/OS components health.
-h, --help help for dcos-diagnostics
--role string Set node role
--verbose Use verbose debug output.
--version Print dcos-diagnostics version
3dt
command dose not exist.
[root@dcos-e2e-default-766cf-master-0 bin]# /opt/mesosphere/bin/3dt --diag
bash: /opt/mesosphere/bin/3dt: No such file or directory
- key services are not running good.
[root@dcos-e2e-default-766cf-master-0 bin]# systemctl | grep dcos-
dcos-adminrouter.service loaded activating auto-restart Admin Router Master: Uses NGINX to expose a unified control plane proxy for components and services
dcos-bouncer.service loaded activating auto-restart DC/OS Identity and Access Manager (Bouncer): offers login methods and user management
dcos-checks-api.service loaded active running DC/OS Checks API: Provides node and cluster checks
● dcos-checks-poststart.service loaded failed failed DC/OS Poststart Checks: Run node-poststart checks
and these services cannot be restarted using systemctl restart
command. Hope someone can help me. TIA
Thank you @wiwengweng for this thorough bug report.
I appreciate that the logging could be better.
The rough order of operations on the wait
command is this:
- Start nodes
- Wait for the SSH service to start up on all nodes (this is what did not work the first time for you, I am unsure why)
- Wait for DC/OS checks to pass
The last of these has to support multiple versions of DC/OS.
It runs multiple commands on each node, to support each version of DC/OS.
Only one of those needs to pass.
It runs those commands on repeat until the services are running well.
The following is a related comment in the code:
The execution will differ for different version of DC/OS.
``dcos-check-runner`` only exists on DC/OS 1.12+. ``dcos-diagnostics
check node-poststart`` only works on DC/OS 1.10 and 1.11. ``3dt`` only
exists on DC/OS 1.9. ``node-poststart`` requires ``sudo`` to allow
reading the CA certificate used by certain checks.
This means that it is not a problem for example that 3dt
does not exist on your nodes.
as I ran on, I recreate the cluster again. But still meet some errors.
How long did you wait to see if the services just started up?
thanks for your response, adamtheturtle. I think I have waited for about 30mins+ before I see this error. Then I also try to manually start sshd in all nodes, and be sure they are all alive. But still see some services are "down/auto restart" state in my master container.
dcos-adminrouter.service loaded activating start-pre start Admin Router Master: Uses NGINX to expose a unified control plane proxy for components and services
dcos-bouncer.service loaded activating auto-restart DC/OS Identity and Access Manager (Bouncer): offers login methods and user management
dcos-checks-api.service loaded active running DC/OS Checks API: Provides node and cluster checks
● dcos-checks-poststart.service loaded failed failed DC/OS Poststart Checks: Run node-poststart checks
dcos-cockroach.service loaded activating auto-restart CockroachDB: Database for the DC/OS IAM
by this command line docker exec --user root --interactive 4449ed14b85f /bin/sh -c systemctl status sshd.socket || systemctl status sshd
and here 4449ed14b85f
is just my master container ID