Problems rebooting host running px-dev
Closed this issue · 13 comments
As part of testing a px-dev POC, I have tried to reboot one of the hosts (RancherOS 0.9.0) that is part of my newly created px cluster.
Once the host has come up and all the containers have started, the node is not able to re-join the cluster. Something is preventing the startup of the px service.
After the startup attempts, I get a lot of the following in the log:
/03/2017 14:27:022017-03-28 12:27:02,793 INFO spawned: 'px_event_listener' with pid 4946
28/03/2017 14:27:032017-03-28 12:27:03,983 INFO success: px_event_listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
28/03/2017 14:27:042017-03-28 12:27:04,208 INFO exited: px_event_listener (exit status 1; not expected)
28/03/2017 14:27:042017-03-28 12:27:04,208 ERRO pool px_event_listener event buffer overflowed, discarding event 463
28/03/2017 14:27:052017-03-28 12:27:05,281 INFO spawned: 'px_event_listener' with pid 4949
28/03/2017 14:27:062017-03-28 12:27:06,417 INFO success: px_event_listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
28/03/2017 14:27:062017-03-28 12:27:06,625 INFO exited: px_event_listener (exit status 1; not expected)
28/03/2017 14:27:062017-03-28 12:27:06,625 ERRO pool px_event_listener event buffer overflowed, discarding event 464
28/03/2017 14:27:072017-03-28 12:27:07,710 INFO spawned: 'px_event_listener' with pid 4952
Restarting the container from Rancher (or from the Docker CLI), gives the following output:
9eea022bb016 9eea022bb016 0b3180a13460 cc3df29026cf 0720fcceaa4c d862d59d48b2 0bb26f652d6b 8f5add2719bb 134bcec2bdb9 9c736891bddd a11849ef8244 57567d8be17b 57567d8be17b b77a2a606112
2017-03-28 12:28:57,274 INFO stopped: pxdaemon (exit status 0)
28/03/2017 14:28:572017-03-28 12:28:57,277 INFO stopped: exec (exit status 2)
28/03/2017 14:28:57Waiting for data availability
28/03/2017 14:28:57Tracing stopped for session pxd
28/03/2017 14:28:57Session pxd destroyed
28/03/2017 14:28:57Exiting lttng
28/03/2017 14:28:572017-03-28 12:28:57,529 INFO stopped: lttng (exit status 0)
28/03/2017 14:28:572017-03-28 12:28:57,656 INFO stopped: relayd (exit status 0)
28/03/2017 14:29:04Tue Mar 28 12:29:04 UTC 2017 : Running on Linux nodeable-rancheros-pxd-003 4.9.15-rancher #1 SMP Wed Mar 15 05:07:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
28/03/2017 14:29:04size for /dev/shm is 67100672, less than required 293601280
28/03/2017 14:29:07PXD version: ab65c7fb556554eeb279c49c3d3d571576422669
28/03/2017 14:29:08Using cluster: 14c12ce6-13a7-11e7-93ae-92361f002671
28/03/2017 14:29:08Key Value Store: etcd://10.42.226.249:2379
28/03/2017 14:29:08Using storage device: /dev/xvdc
28/03/2017 14:29:08/docker-entry-point.sh: line 721: /sys/fs/cgroup/cpu/cpu.rt_runtime_us: Permission denied
28/03/2017 14:29:08Failed to enable rt scheduler
28/03/2017 14:29:102017-03-28 12:29:10,711 CRIT Supervisor running as root (no user in config file)
28/03/2017 14:29:102017-03-28 12:29:10,720 INFO supervisord started with pid 1
28/03/2017 14:29:112017-03-28 12:29:11,777 INFO spawned: 'relayd' with pid 67
28/03/2017 14:29:112017-03-28 12:29:11,802 INFO spawned: 'lttng' with pid 68
28/03/2017 14:29:112017-03-28 12:29:11,896 INFO spawned: 'exec' with pid 69
28/03/2017 14:29:122017-03-28 12:29:11,983 INFO spawned: 'pxdaemon' with pid 70
28/03/2017 14:29:122017-03-28 12:29:12,232 INFO spawned: 'px_event_listener' with pid 71
28/03/2017 14:29:122017-03-28 12:29:12,833 INFO success: relayd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
28/03/2017 14:29:122017-03-28 12:29:12,833 INFO success: lttng entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
28/03/2017 14:29:122017-03-28 12:29:12,838 INFO success: exec entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
28/03/2017 14:29:12Tue Mar 28 12:29:12 UTC 2017 size 4096 is within limits of maxsize 436207616
28/03/2017 14:29:122017-03-28 12:29:12,950 INFO success: pxdaemon entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
28/03/2017 14:29:13PXPROCS: lttng not started yet...sleeping
28/03/2017 14:29:13
28/03/2017 14:29:132017-03-28 12:29:13,029 INFO success: px_event_listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
28/03/2017 14:29:16PXPROCS: lttng not started yet...sleeping
8/03/2017 14:29:16PXPROCS: lttng not started yet...sleeping
28/03/2017 14:29:16
28/03/2017 14:29:19PXPROCS: lttng not started yet...sleeping
28/03/2017 14:29:19
28/03/2017 14:29:19Spawning a session daemon
28/03/2017 14:29:19Session pxd created.
28/03/2017 14:29:19Traces will be written in net://localhost
28/03/2017 14:29:19Live timer set to 2000000 usec
28/03/2017 14:29:19Session pxd set to shm_path: /var/lib/osd/lttng/pxd-20170328-122919.
28/03/2017 14:29:19UST channel pxd_channel enabled for session pxd
28/03/2017 14:29:20All UST events are enabled in channel pxd_channel
28/03/2017 14:29:20Tracing started for session pxd
28/03/2017 14:29:22PXPROCS: Started px-ns with pid 123
28/03/2017 14:29:22PXPROCS: Started px-storage with pid 124
28/03/2017 14:29:22bash: connect: Connection refused
28/03/2017 14:29:22bash: /dev/tcp/localhost/9009: Connection refused
28/03/2017 14:29:22PXPROCS: px-storage not started yet...sleeping
28/03/2017 14:29:22
28/03/2017 14:29:22time="2017-03-28T12:29:22Z" level=info msg="px-ns Starting.."
28/03/2017 14:29:22NS client starting fuse module
28/03/2017 14:29:22Starting NS server
28/03/2017 14:29:22C++ grpc server listening on 0.0.0.0:9009
28/03/2017 14:29:25PXPROCS: Started px with pid 145
28/03/2017 14:29:25PXPROCS: Started watchdog with pid 146
28/03/2017 14:29:252017-03-28_12:29:25: PX-Watchdog: Starting watcher
28/03/2017 14:29:262017-03-28_12:29:26: PX-Watchdog: Waiting for px process to start
28/03/2017 14:29:27root 145 70 58 12:29 ? 00:00:01 /usr/local/bin/px -daemon
28/03/2017 14:29:272017-03-28_12:29:27: PX-Watchdog: (pid 145): Begin monitoring
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Registering [kernel] as a volume driver"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Starting PX Version: 1.1.6-693c3b5 - Build Version 693c3b5a7b1dbc9eda8f2bb7e07c44455c94aafe"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Developer license detected"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Initializing PX hooks with none"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=error msg="Could not initialize scheduler hooks for none: Unhandled scheduler "
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Loaded existing NodeID: 47765f32-d8ae-44b9-80e2-b98feea1037d"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX Configuration Loaded..."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX Cluster ID: 14c12ce6-13a7-11e7-93ae-92361f002671"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX Node ID: 47765f32-d8ae-44b9-80e2-b98feea1037d"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX Node Index: 3"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX Management Iface: "
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX KVDB: [http://10.42.226.249:2379]"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX Logging API: "
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX Storage Devices: [/dev/xvdc], Raid Level: data() md()"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX initialized with Docker host unix:///var/run/docker.sock"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Alerts initialized successfully for this cluster"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX starting cluster manager..."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX cluster manager running."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX limiting cluster size to 3."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Starting server on port: 9001"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="PX API server running on port 9001."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Adding cluster event listener: PX Storage Service"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Adding cluster event listener: Scheduler"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Cluster manager starting..."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Cluster state is OK... Joining the cluster."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Node 47765f32-d8ae-44b9-80e2-b98feea1037d joining cluster..."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Cluster ID: 14c12ce6-13a7-11e7-93ae-92361f002671"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Node Mgmt IP: 172.20.20.49"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=info msg="Node Data IP: 172.20.20.49"
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=error msg="This node is already initialized but could not be found in the cluster map."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=error msg="Failed to initialize node in cluster. This node is already initialized but could not be found in the cluster map."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=error msg="Init Failure: This node is already initialized but could not be found in the cluster map. Please remove /etc/pwx/.private.json to initialize a new node."
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=warning msg="Not found: /v1/osd-volumes "
28/03/2017 14:29:27time="2017-03-28T12:29:27Z" level=error msg="Error while calling home: HTTP-404: 404 page not found\n"
Almost looks like it is trying to re-initialize the node instead of re-joining the cluster.
@asthomasdk how many nodes do you have in this cluster?
I have 3 nodes.
@asthomasdk do all nodes fail to come up or just this one node? And if all nodes fail to come up, is the etcd cluster data persistent?
I'll need to test that - I only tried rebooting node so far.
OK but the etcd server here: http://10.42.226.249:2379 has it's data on a persistent volume, correct? Was the etcd server itself restarted and respun on a different node?
This is actually an etcd-ha cluster started from rancher before I started portworx. I am not 100% sure if this is mapping data to the host or storing it in a volume inside the container. But in either case it is being persisted.
Can you give us the output of http://10.42.226.249:2379/v2/keys/pwx
? Essentially this PX node is complaining that it cannot find the cluster meta data in etcd.
@asthomasdk Do you still have this output or the issue? Please let us know
I get similar error, i'm using rancheros on my VM
8/9/2017 2:13:13 PMPXPROCS: Started px with pid 1691
8/9/2017 2:13:13 PMPXPROCS: Started watchdog with pid 1692
8/9/2017 2:13:13 PM2017-08-09_ 8:43:13: PX-Watchdog: Starting watcher
8/9/2017 2:13:13 PM2017-08-09_ 8:43:13: PX-Watchdog: Waiting for px process to start
8/9/2017 2:13:13 PMroot 1691 1648 21 08:43 ? 00:00:00 /usr/local/bin/px -daemon
8/9/2017 2:13:13 PM2017-08-09_ 8:43:13: PX-Watchdog: (pid 1691): Begin monitoring
8/9/2017 2:13:13 PMtime="2017-08-09T08:43:13Z" level=info msg="Registering [kernel] as a volume driver"
8/9/2017 2:13:13 PMtime="2017-08-09T08:43:13Z" level=info msg="Starting PX Version: 1.2.9-17d16e4 - Build Version 17d16e4910fe7a4ac5a3d6549bfae5457448fa9c"
8/9/2017 2:13:14 PMtime="2017-08-09T08:43:14Z" level=info msg="Node is not yet initialized"
8/9/2017 2:13:14 PMtime="2017-08-09T08:43:14Z" level=error msg="Could not load config file /etc/pwx/config.json due to: Configuration check failed: Unknown discovery endpoint : http in [http://10.42.118.4:2379]. Please visit http://docs.portworx.com for more information."
8/9/2017 2:13:14 PMPXPROCS: px daemon exited with code: 1
8/9/2017 2:13:14 PM1656
8/9/2017 2:13:14 PMPXPROCS: trying to kill px-storage...
8/9/2017 2:13:14 PM2017-08-09 08:43:14,916 INFO exited: pxdaemon (exit status 1; not expected)
8/9/2017 2:13:15 PM2017-08-09 08:43:15,806 INFO spawned: 'pxdaemon' with pid 1739
8/9/2017 2:13:15 PM2017-08-09 08:43:15,807 INFO reaped unknown pid 567
8/9/2017 2:13:15 PMPXPROCS: Started px-storage with pid 1748
8/9/2017 2:13:16 PMbash: connect: Connection refused
8/9/2017 2:13:16 PMbash: /dev/tcp/localhost/9009: Connection refused
8/9/2017 2:13:16 PMPXPROCS: px-storage not started yet...sleeping
8/9/2017 2:13:16 PMC++ grpc server listening on 0.0.0.0:9009
8/9/2017 2:13:17 PM2017-08-09 08:43:17,230 INFO success: pxdaemon entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
@karthik101 Can you paste the out of /etc/pwx/config.json?
@karthik101 are you still blocked by this?
@venkatpx Sorry, I left the portworx as it is after that, will let you know if i start working on it again. Is portworx supported out the box in rancher with kubernetes.
Yes. Portworx is supported out of the box with Kubernetes and should work rancher as Portworx has native volume driver in Kubernetes