eBayClassifiedsGroup/PanteraS

Inconsistent state

Closed this issue · 2 comments

kopax commented

I would like to excuse, if I don't provide enough informations on this bug.

So far, I wasn't able to find any log for this.

I have started 3 panteras M + S instance, no problem.

I can run a few dockerized application on Marathon.

After a few weeks of usage, the hard drive get more and more full.
Is there any kind of log we need to flush ?

Also, I have tried some deployments and they got stuck in deployment, no containers get started and not a single log in mesos.

I wonder if this is done because a server got disconnected/reconnected from the network and the paas went in a inconsistent state.

I did restart all 3 servers by stopping panteras + erasing panteras container + rm -rf /tmp/mesos/* made the trick but it is not a good solution for long terme. It required to restart all the services at once.

Is there another way to get over this bug ?

Ad cleanups and proper config (all points are very important):

  1. Make sure that your docker log-driver IS NOT buffering all logs,
    but send them to syslog instead.
    To do that set up /etc/default/docker and add parameter like:
    DOCKER_OPTS="--log-driver=syslog ${DOCKER_OPTS}"
  2. Make sure that mesos do basic cleanup --gc_delay=1days
    (you can see mesos-slave with ps command should contain that option)
  3. Make sure you have cronjob on native host: cleanup of docker images, sth like:
    A=$(docker images -q -f dangling=true);[ "$A" ] && docker rmi $A
  4. Make sure, that your apps inside containers logs to a volume, binded from native host,
    so container volume (aufs) is not growing over a time.
  5. If you have experienced orphaned volumes, you might think about this clean up:
    https://github.com/cloudnautique/vol-cleanup

Also, I have tried some deployments and they got stuck in deployment, no containers get started and not a single log in mesos.

This happens when mesos has no more resources (CPU/mem/disk)

I did restart all 3 servers by stopping panteras + erasing panteras container + rm -rf /tmp/mesos/* made the trick but it is not a good solution for long terme. It required to restart all the services at once.

This "hard reset" is only need on total disasters, or upgrades :)
your definitely should survive normal work without that.

kopax commented

Thanks for all your recommendations. I will try all of them asap.