Inconsistent state
Closed this issue · 2 comments
I would like to excuse, if I don't provide enough informations on this bug.
So far, I wasn't able to find any log for this.
I have started 3 panteras M + S instance, no problem.
I can run a few dockerized application on Marathon.
After a few weeks of usage, the hard drive get more and more full.
Is there any kind of log we need to flush ?
Also, I have tried some deployments and they got stuck in deployment, no containers get started and not a single log in mesos.
I wonder if this is done because a server got disconnected/reconnected from the network and the paas went in a inconsistent state.
I did restart all 3 servers by stopping panteras + erasing panteras container + rm -rf /tmp/mesos/* made the trick but it is not a good solution for long terme. It required to restart all the services at once.
Is there another way to get over this bug ?
Ad cleanups and proper config (all points are very important):
- Make sure that your docker log-driver IS NOT buffering all logs,
but send them to syslog instead.
To do that set up/etc/default/docker
and add parameter like:
DOCKER_OPTS="--log-driver=syslog ${DOCKER_OPTS}"
- Make sure that mesos do basic cleanup
--gc_delay=1days
(you can seemesos-slave
withps
command should contain that option) - Make sure you have cronjob on native host: cleanup of docker images, sth like:
A=$(docker images -q -f dangling=true);[ "$A" ] && docker rmi $A
- Make sure, that your apps inside containers logs to a volume, binded from native host,
so container volume (aufs) is not growing over a time. - If you have experienced orphaned volumes, you might think about this clean up:
https://github.com/cloudnautique/vol-cleanup
Also, I have tried some deployments and they got stuck in deployment, no containers get started and not a single log in mesos.
This happens when mesos has no more resources (CPU/mem/disk)
I did restart all 3 servers by stopping panteras + erasing panteras container + rm -rf /tmp/mesos/* made the trick but it is not a good solution for long terme. It required to restart all the services at once.
This "hard reset" is only need on total disasters, or upgrades :)
your definitely should survive normal work without that.
Thanks for all your recommendations. I will try all of them asap.