magneticio/vamp

Vamp shows deployment status FAILED after a couple of days. Process is fine.

tnolet opened this issue · 4 comments

An extremely simple deployment of 1 container is marked as failed after 5 days living on a cluster.
The error, taken from the API, is the following. However, the service runs fine as in the process is responding and everything works as expected.

services": [
        {
          "status": {
            "intention": "Deployment",
            "since": "2017-09-06T21:52:34.817Z",
            "phase": {
              "name": "Failed",
              "since": "2017-09-06T21:52:34.817Z",
              "notification": "Deployment service error was not restored. Deployment service error for deployment 'simpleservice:test' and service 'simpleservice:1.1.0'."
            }
          },

DCOS 1.9.0
VAMP 0.9.5-1237-g7000ce10

This is a known issue, and has to do with the synchronization handler that sometimes is not able to synchronize the Vamp state with the marathon/mesos state, especially after a Marathon or ZK restart.

It happens almost always after a Vamp (container) restart, and then restarting Zookeeper. Errors in the Vamp UI log then show: (nb this is wile Zookeeper is still recovering)

14:34:54.502	ERROR	io.vamp.common.notification.Notification	Deployment service error for deployment 'sava' and service 'sava:1.0.0'.
14:34:54.504	ERROR	io.vamp.common.notification.Notification	Deployment service error for deployment 'sava' and service 'sava:1.1.0'.
14:34:54.504	ERROR	io.vamp.common.notification.Notification	Deployment service error for deployment 'sava2' and service 'sava-backend1:1.2.0'.
14:34:54.505	ERROR	io.vamp.common.notification.Notification	Deployment service error for deployment 'sava2' and service 'sava-backend2:1.2.0'.
14:34:54.505	ERROR	io.vamp.common.notification.Notification	Deployment service error for deployment 'sava2' and service 'sava-frontend:1.2.0'.
14:34:54.505	ERROR	io.vamp.common.notification.Notification	Deployment service error for deployment 'sava2' and service 'sava-frontend:1.3.0'.
14:34:54.505	ERROR	io.vamp.common.notification.Notification	Deployment service error for deployment 'sava2' and service 'sava-backend:1.3.0'.

errors in Vamp container log:

�[36m12:09:18.461�[0;39m �[34m| INFO  | i.v.c.marathon.MarathonDriverActor       | /vamp/user/vamp/marathon-driver-actor-63 | marathon create workflow: allocation
�[0;39m�[36m12:09:18.465�[0;39m �[1;31m| ERROR | akka.actor.RepointableActorRef           | ervisor-8/flow-16283-0-unknown-operation | Error in stage [Recover(<function1>)]: Tcp command [Connect(marathon.mesos:8080,None,List(),Some(10 seconds),true)] failed
�[0;39mio.vamp.common.http.HttpClientException: Tcp command [Connect(marathon.mesos:8080,None,List(),Some(10 seconds),true)] failed
	at io.vamp.common.http.HttpClient$$anonfun$recoverWith$1$1.applyOrElse(HttpClient.scala:128)
	at io.vamp.common.http.HttpClient$$anonfun$recoverWith$1$1.applyOrElse(HttpClient.scala:117)
	at akka.stream.impl.fusing.Recover$$anon$11.onUpstreamFailure(Ops.scala:255)
	at akka.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:733)
	at akka.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:616)
	at akka.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:471)
	at akka.stream.impl.fusing.GraphInterpreterShell.receive(ActorGraphInterpreter.scala:423)
	at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:603)
	at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:618)
	at akka.actor.Actor.aroundReceive(Actor.scala:496)
	at akka.actor.Actor.aroundReceive$(Actor.scala:494)
	at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:529)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

NB This state also can appear when running "in-memory" without SQL persistence.

On some setups restarting ZK doesn't seem to do a lot. To reproduce the failed state issue A) deploy the blueprints from the first 4 tutorials B) give the services in the deployments some higher scale (ie more instances), C) now restart Vamp container in DCOS, D) after Vamp has restarted correctly go into DCOS/Marathon and change scale of one vamp-managed container, look in Vamp and see the status of the deployment change to "deploying" and then quickly to "failed".

After going into the deployment, opening scale, and saving it again, the deployment is "green" again.

If changing scale of this specific container in DCOS no failed state will appear, but when changing scale of another container the failed state will appear. Only after forcefully setting scale in Vamp again is this specific container "safe" for DCOS-initiated scale changes.

Restarting Vamp before this is crucial. Also relevant seems to be that after a Vamp container restart SQL persistence actor timeout errors appear in the Vamp log. It could be that the "graph" that Vamp has in memory of a deployment is not read/set correctly due to these timeouts, and thus the sync event is not handled correctly?

�[36m21:42:32.202�[0;39m �[1;31m| ERROR | io.vamp.http_api.HttpApiRoute            |                                          | Request to /deployments could not be handled normally: Ask timed out on [Actor[akka://vamp/user/vamp/my-sql-persistence-actor-43#-649154374]] after [10000 ms]. Sender[null] sent message of type "io.vamp.persistence.CommonPersistenceMessages$All".
�[0;39m�[36m21:42:33.090�[0;39m �[1;31m| ERROR | io.vamp.common.notification.Notification |                                          | Internal server error.
�[0;39m�[36m21:42:33.101�[0;39m �[1;31m| ERROR | io.vamp.common.notification.Notification |                                          | Ask timed out on [Actor[akka://vamp/user/vamp/my-sql-persistence-actor-43#-649154374]] after [10000 ms]. Sender[null] sent message of type "io.vamp.persistence.CommonPersistenceMessages$All".
�[0;39makka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://vamp/user/vamp/my-sql-persistence-actor-43#-649154374]] after [10000 ms]. Sender[null] sent message of type "io.vamp.persistence.CommonPersistenceMessages$All".
	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:864)
	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
	at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:862)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:745)