Cacti poller stalls if ssh process stalls
GoogleCodeExporter opened this issue · 1 comments
GoogleCodeExporter commented
I'm not entirely sure if this is a "Better Cacti Graphs" problem or a generic
Cacti problem. Maybe it's both.
We are occasionally seeing an ssh process stalling on the Cacti box. It looks
like this:
ssh -q -o ConnectTimeout 10 -o StrictHostKeyChecking no cactiuser@192.168.1.1
-p 22 -i /usr/local/etc/cacti/id_rsa wget -U Cacti/1.0 -q -O - -T 5
"http://localhost/server-status?auto"
The memcached check is also stalled currently on this system. I've seen them
stalled for up to an hour before I caught them. Presumably they would last
longer.
The server it's trying to check does not currently have an appropriate ssh key
so these queries haven't been working anyway. If I try that command myself it
prompts me for a password but this isn't the problem or it would happen every
time. I have only ever seen it happen to this server.
It doesn't happen immediately or reliably but once the ssh process has stalled,
every time the poller gets up to that host it stalls as well and doesn't
process any hosts after that. This means that all the graphs for hosts that
were created after that one (i.e, have a higher Host ID) stop working and we
get an extra couple of processes running on the Cacti machine every 5 minutes.
The ss_get_by_ssh.php script that spawned the stalled ssh process has a write
lock on the cache file and the subsequent ones have it open for writing but
with no write lock.
My suspicion is that this is the reason for the poller stalling. Cacti has no
timeout for local scripts and ss_get_by_ssh.php has no timeout for getting a
write lock on the cache file.
Killing the ssh process (or all of them if multiple have stalled) starts
everything working again.
Reproducing the exact problem is difficult. I can't even reliably manage it on
the systems we have here. I just have to wait until it happens. Creating a
Cacti setup using ss_get_by_ssh.php with a host with no SSH key may work.
Reproducing something that looks like this issue is easy. I created a simple
PHP script that opened one of the cache files with a write lock.
<?php
$handle = fopen("/tmp/192.168.1.1_apache_localhost__cacti_stats.txt", "r+");
flock($handle, LOCK_EX);
sleep(1800);
flock($handle, LOCK_UN);
?>
Run this and wait for the poller to run from cron.
Replacing curl on a target system with a script that did sleep(1800) would also
work.
Original issue reported on code.google.com by ladadad...@gmail.com
on 25 Oct 2010 at 4:27
GoogleCodeExporter commented
I'm not sure how to address this. Have you learned anything more about the
problem?
I don't think that a timeout on the lock call is supported everywhere.
Original comment by baron.schwartz
on 15 Jan 2011 at 6:04
- Changed state: Accepted
- Added labels: Type-Other
- Removed labels: Type-Defect