cloudfoundry-community/logsearch-boshrelease

Post-start script randomly failing

Closed this issue · 12 comments

I'm testing out the new functionality using delayed shard allocation instead of disabling it during deployments. Most of the time it seems to work fine but randomly I'm getting failures where the Elasticsearch job hasn't fully connected to the cluster by the time the post-start script starts up, so when it queries the status from localhost it's getting a "green" status based on invalid info as it doesn't know the real status yet.

Theoretically you could just add in a sleep at the top of the script to give the job more time, but I think it's probably better to just have all the curl calls in the script go to the first master node to ensure it's always getting accurate info, as opposed to the mix of the first master and localhost in there now.

Hi @amhuber,
thank you for report.
Could you tell us more about your environment? How many indexes do you have, how much data there? Did you tried to increase elasticsearch.recovery.delay_allocation interval?

Currently I've got 3 masters and 6 data nodes with 3 shards and 1 replica per index to get one shard per node. We're expecting to keep 7 days of data which will be about 1TB, but at present we've got about 2.5 days of data which is 300GB. Here is current status for example:

{
  "cluster_name" : "logsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 9,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 3034,
  "active_shards" : 6068,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Here is the log from the post-start script that failed:

{"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}
body: {"cluster_name":"logsearch","status":"green","timed_out":false,"number_of_nodes":5,"number_of_data_nodes":2,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
status: 200
{"cluster_name":"logsearch","status":"green","timed_out":false,"number_of_nodes":5,"number_of_data_nodes":2,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
{"acknowledged":true}{"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}

How I'm reading that is that when it checked the status from localhost it reported as green but it clearly doesn't know anything about the actual cluster status yet as you can see it's only reporting 2 total shards and the wrong number of nodes. As soon as it failed I queried the health from localhost on that node and it reported as expected, so I assume this is just a timing issue.

For confirmation, here is the file info on the post-start log:

  File: ‘post-start.stdout.log’
  Size: 1021            Blocks: 8          IO Block: 4096   regular file
Device: 812h/2066d      Inode: 652864      Links: 1
Access: (0640/-rw-r-----)  Uid: ( 1000/    vcap)   Gid: ( 1000/    vcap)
Access: 2017-03-08 21:04:32.665161744 +0000
Modify: 2017-03-08 20:20:13.547649421 +0000
Change: 2017-03-08 20:20:13.547649421 +0000
 Birth: -

The corresponding entries from the Elasticsearch log:

[2017-03-08 20:20:11,477][INFO ][node                     ] [elk_data/3] starting ...
[2017-03-08 20:20:12,028][INFO ][transport                ] [elk_data/3] publish_address {x.x.x.x:9300}, bound_addresses {0.0.0.0:9300}
[2017-03-08 20:20:12,033][INFO ][discovery                ] [elk_data/3] logsearch/fzhIGbEeS2uednXorvFzqA
[2017-03-08 20:20:15,673][INFO ][cluster.service          ] [elk_data/3] detected_master {elk_master/0}{kv5ex88sSwirVDnkLZUZcQ}<snipped>
[2017-03-08 20:20:16,544][INFO ][http                     ] [elk_data/3] publish_address {x.x.x.x:9200}, bound_addresses {0.0.0.0:9200}
[2017-03-08 20:20:16,544][INFO ][node                     ] [elk_data/3] started

Comparing the timestamps, the entire post-start script ran and finished at 20:20:13 from the modify time, and Elasticsearch wasn't fully loaded until 20:20:16. I'm not sure what part of the script actually errored as there is no detail in stderr but something clearly set a non-zero exit code. Here is the error from BOSH:

Action Failed get_task: Task 1771190f-1db4-4496-409f-fe6d5a1d7891 result: 1 of 1 post-start scripts failed. Failed Jobs: elasticsearch. (00:02:14)
Error 450001: Action Failed get_task: Task 1771190f-1db4-4496-409f-fe6d5a1d7891 result: 1 of 1 post-start scripts failed. Failed Jobs: elasticsearch.

To be clear, the script is working most of the time, I'm just getting these errors randomly on some nodes apparently based on the timing. A simple delay would probably fix it, or getting the status from the master, or even looping until "node started" appears in the Elasticsearch log.

Thinking about this further, I really think it would be better to get the status from the master. If this node never joins the master successfully for some reason it will apparently report "green" even when the cluster isn't and BOSH will move on to the next node before the current one recovered which is what you're trying to prevent in the first place. :-)

Sorry for the repeated updates, but I think I just figured out what is happening. I was assuming the post-start log got truncated on each run so the output above was from only the current run, but it keeps appending so the first three lines are from the initial deployment:

{"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}
body: {"cluster_name":"logsearch","status":"green","timed_out":false,"number_of_nodes":5,"number_of_data_nodes":2,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
status: 200
{"cluster_name":"logsearch","status":"green","timed_out":false,"number_of_nodes":5,"number_of_data_nodes":2,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
{"acknowledged":true}{"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}

That makes more sense why it was green with no shards - those are the initial Kibana shards with no data at that point.

The last line is from the current deployment, so the allocation enable to the master worked and then the query status just failed as Elasticsearch hadn't started yet, which is why the script exited with a non-zero code.

That makes much more sense. It would also be nice to get time stamps on the entries in the post-start log to make figuring that out a bit easier.

Default elasticsearch.health.timeout value is 300 (5 min).
When you have a lot of data in your cluster, is possibly takes longer to restart a node. Please check time stamps in the logs when Elasticsearch starts. Maybe your solution is just increasing timeout value.

I don't think you're understanding the issue. The problem is with this command:

https://github.com/cloudfoundry-community/logsearch-boshrelease/blob/develop/jobs/elasticsearch/templates/bin/post-start.erb#L15

The very first time it's run if the Elasticsearch daemon hasn't fully started and joined the cluster it will fail, resulting in the script failing. The timeout value you mentioned is irrelevant, it just determines how long the script will wait for the health to change from yellow to green.

Just making sure @jmcarp sees this as he wrote the new script. I can submit a pull request if desired but basically I think the script should be changed as follows:

  • All curl requests should go to the first master node. In this case we don't really care what the local node thinks the cluster status is, we can just check the master to see what it really is. That eliminates any concern about whether the local node has fully started or successfully joined the cluster - until the cluster is green we don't want to take down the next node.
  • With that change you can simplify the logic as you'd no longer need to differentiate between the master or data nodes to check health or health?local=true

amhuber@59d6cda

I can submit a pull if you agree.

One more catch for @jmcarp
Since cluster_monitor job using the same elasticsearch template - it is also using post-start script to make sure it's back online after recreate. But since it contain just a single data node - it will never come to green state, because there are to place for replicas. Because of that, any cluster update should fail :-(
WDYT?

I added an additional commit to my proposed fix:

amhuber@2726412

It adds a 15 second delay on starting the script if the current node is the first master node to make sure that the service is responding before trying to send curl requests to it. I've fully tested this logic in my large deployment and everything is working at this point, but I do not have a cluster_monitor deployed so additional work will be needed to fix the issue reported above.

Any update on how you want to address this issue?

Just noticed that this commit as part of the ELK5 updates should resolve the issue:

2fe4a4c

I'll close this for now until we can test using the new code.