ganglia/monitor-core

gmetad interactive port stops functioning occasionally

cburroughs opened this issue · 4 comments

This is with gmetad 3.5, I do not believe it is a new problem but had not previously tracked it down to a problem with the interactive port. I have not customized the number of server_threads, which I believe should leave me with the default 4. What I am seeing once a week or so is that the web ui becomes unresponsive (page load blocks indefinitely). Data collection and answering non-interactive xml requests is unaffected.

echo "/?filter=summary" |   nc localhost 8652

Hangs indefinitely.

I saw several ESTABLISHED connections to 8652, after restarting httpd (to see if it was at fault) the connections sat in CLOSE_WAIT. After httpd restart trying to load the web ui get's "There was an error collecting ganglia data (127.0.0.1:8652): XML error: Invalid document end at 1" instead of a hang. Restarting gmetad fixes the problem.

# lsof -p 2400 | grep -i 8652
gmetad  2400 nobody    1u  IPv4            2388480             TCP *:8652 (LISTEN)
gmetad  2400 nobody    6u  IPv4            6481200             TCP lsu02.clearspring.local:8652->lsu02.clearspring.local:51602 (CLOSE_WAIT)
gmetad  2400 nobody    7u  IPv4            7138517             TCP lsu02.clearspring.local:8652->lsu02.clearspring.local:32786 (CLOSE_WAIT)
gmetad  2400 nobody   11u  IPv4            7136011             TCP lsu02.clearspring.local:8652->lsu02.clearspring.local:60970 (CLOSE_WAIT)

(I am not sure why I only end up with 3 suck sockets, instead of 4.)

Thread 23 (Thread 0x418a9940 (LWP 2402)):
#0  0x0000003b31c0db3b in accept () from /lib64/libpthread.so.0
#1  0x0000000000405488 in pthread_attr_setdetachstate ()
#2  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#3  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 22 (Thread 0x422aa940 (LWP 2403)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000405474 in pthread_attr_setdetachstate ()
#4  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 21 (Thread 0x42cab940 (LWP 2404)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000404999 in pthread_attr_setdetachstate ()
#4  0x0000000000404b45 in pthread_attr_setdetachstate ()
#5  0x0000000000404a3d in pthread_attr_setdetachstate ()
#6  0x0000000000405588 in pthread_attr_setdetachstate ()
#7  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 20 (Thread 0x436ac940 (LWP 2405)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000404999 in pthread_attr_setdetachstate ()
#4  0x0000000000404b45 in pthread_attr_setdetachstate ()
#5  0x0000000000404a3d in pthread_attr_setdetachstate ()
#6  0x0000000000405588 in pthread_attr_setdetachstate ()
#7  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 19 (Thread 0x440ad940 (LWP 2406)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000404999 in pthread_attr_setdetachstate ()
#4  0x0000000000404b45 in pthread_attr_setdetachstate ()
#5  0x0000000000404a3d in pthread_attr_setdetachstate ()
#6  0x0000000000405588 in pthread_attr_setdetachstate ()
#7  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 18 (Thread 0x44aae940 (LWP 2407)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000404999 in pthread_attr_setdetachstate ()
#4  0x0000000000404b45 in pthread_attr_setdetachstate ()
#5  0x0000000000404a3d in pthread_attr_setdetachstate ()
#6  0x0000000000405588 in pthread_attr_setdetachstate ()
#7  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x454af940 (LWP 2408)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x45eb0940 (LWP 2409)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x468b1940 (LWP 2410)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x472b2940 (LWP 2411)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x47cb3940 (LWP 2412)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x486b4940 (LWP 2413)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x490b5940 (LWP 2414)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x49ab6940 (LWP 2415)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x4a4b7940 (LWP 2416)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x4aeb8940 (LWP 2417)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x4b8b9940 (LWP 2418)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x4c2ba940 (LWP 2419)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x4ccbb940 (LWP 2420)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x4d6bc940 (LWP 2421)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000406d9e in pthread_attr_setdetachstate ()
#4  0x0000003b39809bc9 in ?? () from /lib64/libexpat.so.0
#5  0x0000003b3980ab44 in ?? () from /lib64/libexpat.so.0
#6  0x0000003b3980b66a in ?? () from /lib64/libexpat.so.0
#7  0x0000003b3980cc4b in ?? () from /lib64/libexpat.so.0
#8  0x0000003b39803ef1 in XML_ParseBuffer () from /lib64/libexpat.so.0
#9  0x0000000000405920 in pthread_attr_setdetachstate ()
#10 0x0000000000404522 in pthread_attr_setdetachstate ()
#11 0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#12 0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x4e0bd940 (LWP 2422)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x000000000040440e in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x4eabe940 (LWP 2423)):
#0  0x0000003b314cd722 in select () from /lib64/libc.so.6
#1  0x0000003b3341f915 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x00000000004091b7 in pthread_attr_setdetachstate ()
#3  0x0000003b31c0673d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003b314d44bd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ae116e772f0 (LWP 2400)):
#0  0x0000003b31c0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b31c08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003b31c08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000403348 in pthread_attr_setdetachstate ()
#4  0x0000003b6b00b558 in hash_foreach () from /usr/lib64/libganglia-3.5.0.so.0
#5  0x00000000004030ca in pthread_attr_setdetachstate ()
#6  0x0000003b3141d994 in __libc_start_main () from /lib64/libc.so.6
#7  0x0000000000402b29 in pthread_attr_setdetachstate ()
#8  0x00007fffed188098 in ?? ()
#9  0x0000000000000000 in ?? ()

Is there any other debugging information I can provide, or should capture when this next occurs?

I would suggest starting to keep track of connections to gmetad. Something like

netstat -an | grep 8652 | wc -l

maybe even get the breakdown ie. TIME_WAIT, ESTABLISHED and see if that points to anything interesting.

If this is still happening make sure you load the debug symbols for ganglia when you run gdb. That would help a lot in understanding where exactly in the gmetad code each thread has hung.

I do not believe I have seen this in a while, at least since I installed debug symbols.