gmetad interactive port stops functioning occasionally
cburroughs opened this issue · 4 comments
This is with gmetad 3.5, I do not believe it is a new problem but had not previously tracked it down to a problem with the interactive port. I have not customized the number of server_threads, which I believe should leave me with the default 4. What I am seeing once a week or so is that the web ui becomes unresponsive (page load blocks indefinitely). Data collection and answering non-interactive xml requests is unaffected.
echo "/?filter=summary" | nc localhost 8652
Hangs indefinitely.
I saw several ESTABLISHED connections to 8652, after restarting httpd (to see if it was at fault) the connections sat in CLOSE_WAIT. After httpd restart trying to load the web ui get's "There was an error collecting ganglia data (127.0.0.1:8652): XML error: Invalid document end at 1" instead of a hang. Restarting gmetad fixes the problem.
# lsof -p 2400 | grep -i 8652
gmetad 2400 nobody 1u IPv4 2388480 TCP *:8652 (LISTEN)
gmetad 2400 nobody 6u IPv4 6481200 TCP lsu02.clearspring.local:8652->lsu02.clearspring.local:51602 (CLOSE_WAIT)
gmetad 2400 nobody 7u IPv4 7138517 TCP lsu02.clearspring.local:8652->lsu02.clearspring.local:32786 (CLOSE_WAIT)
gmetad 2400 nobody 11u IPv4 7136011 TCP lsu02.clearspring.local:8652->lsu02.clearspring.local:60970 (CLOSE_WAIT)
(I am not sure why I only end up with 3 suck sockets, instead of 4.)
Thread 23 (Thread 0x418a9940 (LWP 2402)):
#0 0x0000003b31c0db3b in accept () from /lib64/libpthread.so.0
#1 0x0000000000405488 in pthread_attr_setdetachstate ()
#2 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#3 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 22 (Thread 0x422aa940 (LWP 2403)):
#0 0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000405474 in pthread_attr_setdetachstate ()
#4 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#5 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 21 (Thread 0x42cab940 (LWP 2404)):
#0 0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000404999 in pthread_attr_setdetachstate ()
#4 0x0000000000404b45 in pthread_attr_setdetachstate ()
#5 0x0000000000404a3d in pthread_attr_setdetachstate ()
#6 0x0000000000405588 in pthread_attr_setdetachstate ()
#7 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 20 (Thread 0x436ac940 (LWP 2405)):
#0 0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000404999 in pthread_attr_setdetachstate ()
#4 0x0000000000404b45 in pthread_attr_setdetachstate ()
#5 0x0000000000404a3d in pthread_attr_setdetachstate ()
#6 0x0000000000405588 in pthread_attr_setdetachstate ()
#7 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 19 (Thread 0x440ad940 (LWP 2406)):
#0 0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000404999 in pthread_attr_setdetachstate ()
#4 0x0000000000404b45 in pthread_attr_setdetachstate ()
#5 0x0000000000404a3d in pthread_attr_setdetachstate ()
#6 0x0000000000405588 in pthread_attr_setdetachstate ()
#7 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 18 (Thread 0x44aae940 (LWP 2407)):
#0 0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000404999 in pthread_attr_setdetachstate ()
#4 0x0000000000404b45 in pthread_attr_setdetachstate ()
#5 0x0000000000404a3d in pthread_attr_setdetachstate ()
#6 0x0000000000405588 in pthread_attr_setdetachstate ()
#7 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x454af940 (LWP 2408)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x45eb0940 (LWP 2409)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x468b1940 (LWP 2410)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x472b2940 (LWP 2411)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x47cb3940 (LWP 2412)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x486b4940 (LWP 2413)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x490b5940 (LWP 2414)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x49ab6940 (LWP 2415)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x4a4b7940 (LWP 2416)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x4aeb8940 (LWP 2417)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x4b8b9940 (LWP 2418)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x4c2ba940 (LWP 2419)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x4ccbb940 (LWP 2420)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x4d6bc940 (LWP 2421)):
#0 0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000406d9e in pthread_attr_setdetachstate ()
#4 0x0000003b39809bc9 in ?? () from /lib64/libexpat.so.0
#5 0x0000003b3980ab44 in ?? () from /lib64/libexpat.so.0
#6 0x0000003b3980b66a in ?? () from /lib64/libexpat.so.0
#7 0x0000003b3980cc4b in ?? () from /lib64/libexpat.so.0
#8 0x0000003b39803ef1 in XML_ParseBuffer () from /lib64/libexpat.so.0
#9 0x0000000000405920 in pthread_attr_setdetachstate ()
#10 0x0000000000404522 in pthread_attr_setdetachstate ()
#11 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#12 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x4e0bd940 (LWP 2422)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x000000000040440e in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x4eabe940 (LWP 2423)):
#0 0x0000003b314cd722 in select () from /lib64/libc.so.6
#1 0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2 0x00000000004091b7 in pthread_attr_setdetachstate ()
#3 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ae116e772f0 (LWP 2400)):
#0 0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000403348 in pthread_attr_setdetachstate ()
#4 0x0000003b6b00b558 in hash_foreach () from /usr/lib64/libganglia-3.5.0.so.0
#5 0x00000000004030ca in pthread_attr_setdetachstate ()
#6 0x0000003b3141d994 in __libc_start_main () from /lib64/libc.so.6
#7 0x0000000000402b29 in pthread_attr_setdetachstate ()
#8 0x00007fffed188098 in ?? ()
#9 0x0000000000000000 in ?? ()
Is there any other debugging information I can provide, or should capture when this next occurs?
I would suggest starting to keep track of connections to gmetad. Something like
netstat -an | grep 8652 | wc -l
maybe even get the breakdown ie. TIME_WAIT, ESTABLISHED and see if that points to anything interesting.
If this is still happening make sure you load the debug symbols for ganglia when you run gdb
. That would help a lot in understanding where exactly in the gmetad code each thread has hung.
I do not believe I have seen this in a while, at least since I installed debug symbols.