ggrandes/apache24-modules

httpd process is using 100% CPU when myfixip module is enabled

jackchen858 opened this issue ยท 25 comments

I am doing some testing with Amazon ELB's proxy protocol and apache 2.4.9 + myfixip now.

The module is working and I can get correct client IP in apache's access log. But I noticed after a while httpd process is trying to use as much as possible CPU.

top - 14:09:48 up 1 day, 19:56, 2 users, load average: 1.05, 1.30, 1.46
Tasks: 80 total, 2 running, 78 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.7%us, 14.4%sy, 0.0%ni, 79.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3857672k total, 3747412k used, 110260k free, 113828k buffers
Swap: 0k total, 0k used, 0k free, 3517740k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30539 wwwrun 20 0 119m 3492 1584 R 99 0.1 5:43.91 httpd2-prefork

strace -p 30539 shows it's non-stop calling epoll_wait

epoll_wait(27, {{EPOLLIN, {u32=2213386584, u64=139975197563224}}}, 2, 4294967295) = 1
epoll_wait(27, {{EPOLLIN, {u32=2213386584, u64=139975197563224}}}, 2, 4294967295) = 1
epoll_wait(27, {{EPOLLIN, {u32=2213386584, u64=139975197563224}}}, 2, 4294967295) = 1
epoll_wait(27, {{EPOLLIN, {u32=2213386584, u64=139975197563224}}}, 2, 4294967295) = 1

netstat shows
netstat -nap | grep tcp | grep 30539
tcp 0 0 172.31.4.10:54357 174.129.224.73:80 ESTABLISHED 30539/httpd2-prefor
tcp 0 0 172.31.4.10:81 172.31.6.247:60080 CLOSE_WAIT 30539/httpd2-prefor

172.31.6.247 is the ELB server with proxy protocol enable;
172.31.4.10:81 is the apache server configured to proxy websocket traffic to ws://echo.websocket.org ( I am just doing a test for websocket proxy).

It looks like the apache server didn't close the socket to ELB.

Did a few tests,

  1. if myfixip is disabled and ELB proxy protocol is disabled, this problem doesn't happen;
  2. if myfixip is enabled but it's directly accessed by client ( so no request with proxy protocol ), this problem doesn't happen, I only see TIME_WAIT between apache server and my client.
  3. if myfixip is enabled, ELB proxy protocol is disabled, and I access the webserver through ELB ( still no request with proxy protocol ), this problem doesn't happen.
  4. only when myfixip is enabled and ELB proxy protocol is enabled and I access the webserver through ELB, this problem will happen.

so it looks like it's related with myfixip module?

The environment is Suse Linux with apache 2.4.9:

apache2-utils-2.4.9-1.2
apache2-prefork-2.4.9-1.2
apache2-event-2.4.9-1.2
apache2-worker-2.4.9-1.2
apache2-devel-2.4.9-1.2
apache2-2.4.9-1.2

Thank you very much for all the details. This will help me find the problem.

I kind of understand what caused the high CPU load now. the ELB is configured to use TCP instead of HTTP ( because I need it to forward websocket to my apache server ).

So I configured the ELB heath check also as TCP ping on port 80. Normally this seems is not a issue; but when myfixip is enabled, the health check tcp ping will cause a 408 error like :

172.31.9.216 - - [24/Jun/2015:17:55:07 +0000] "-" 408 - "-" "-"

and cause the web server CPU spike.

The workaround is to change the heath check method to HTTP.

This is very bizarre. after changed health mornitoring to HTTP. I still got 408 error from ELB.

If I disabled myfixip module:

172.31.22.22 - - [24/Jun/2015:18:30:17 +0000] "GET /maintenance.html HTTP/1.1" 200 506 "-" "ELB-HealthChecker/1.0"
172.31.22.22 - - [24/Jun/2015:18:31:12 +0000] "GET /maintenance.html HTTP/1.1" 200 506 "-" "ELB-HealthChecker/1.0"
172.31.22.22 - - [24/Jun/2015:18:32:07 +0000] "GET /maintenance.html HTTP/1.1" 200 506 "-" "ELB-HealthChecker/1.0"
172.31.22.22 - - [24/Jun/2015:18:33:02 +0000] "GET /maintenance.html HTTP/1.1" 200 506 "-" "ELB-HealthChecker/1.0"
172.31.22.22 - - [24/Jun/2015:18:33:57 +0000] "GET /maintenance.html HTTP/1.1" 200 506 "-" "ELB-HealthChecker/1.0"
172.31.22.22 - - [24/Jun/2015:18:34:52 +0000] "GET /maintenance.html HTTP/1.1" 200 506 "-" "ELB-HealthChecker/1.0"

ELB send a heart beat every 55 seconds as I configured.

If I enable myfixip module:
172.31.22.22 - - [24/Jun/2015:18:36:12 +0000] "-" 408 - "-" "-"
172.31.22.22 - - [24/Jun/2015:18:36:42 +0000] "GET /maintenance.html HTTP/1.1" 200 506 "-" "ELB-HealthChecker/1.0"
172.31.22.22 - - [24/Jun/2015:18:37:12 +0000] "-" 408 - "-" "-"
172.31.22.22 - - [24/Jun/2015:18:37:37 +0000] "GET /maintenance.html HTTP/1.1" 200 506 "-" "ELB-HealthChecker/1.0"
172.31.22.22 - - [24/Jun/2015:18:38:12 +0000] "-" 408 - "-" "-"
172.31.22.22 - - [24/Jun/2015:18:38:32 +0000] "GET /maintenance.html HTTP/1.1" 200 506 "-" "ELB-HealthChecker/1.0"
172.31.22.22 - - [24/Jun/2015:18:39:13 +0000] "-" 408 - "-" "-"
172.31.22.22 - - [24/Jun/2015:18:39:27 +0000] "GET /maintenance.html HTTP/1.1" 200 506 "-" "ELB-HealthChecker/1.0"

The mystery "408" showed up every 1 minute, it's not controlled by the 55 seconds interval from ELB health monitoring.

running tcpdump, now I can see the ELB is connecting to the web server every 1 minute, then close the connection without sending any data.

when myfixip is not enabled, httpd probably just ignore this connection; when myfixip is enabled, httpd generate a 408 error and the process is using 100%CPU. Working with AWS to see why there is this weird traffic now.

AWS support told me this is a "pre-open" connection from ELB to backend instance, seems ELB only does that for HTTP listener.

I also found this to be a problem, I tested on a EC2 t2.small behind an ELB using an Amazon AMI and Apache/2.4.12, but when I thought everything was good to go and put it into production the CPU jumped to 100% and stayed there on two c3.large instances that normally run at about 3-15% CPU usage. I've seen the 408 issue before and have read that it is related to the ELB keeping extra connections to the servers and not closing them. I feel inclined to believe there is something else going on. Any thoughts?

Here is from 'top':

top - 15:35:26 up 27 min, 2 users, load average: 6.88, 4.30, 2.34
Tasks: 100 total, 8 running, 92 sleeping, 0 stopped, 0 zombie
Cpu(s): 99.8%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 3858728k total, 522608k used, 3336120k free, 19868k buffers
Swap: 0k total, 0k used, 0k free, 111088k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2741 apache 20 0 452m 26m 5768 R 30.6 0.7 1:19.92 httpd
2745 apache 20 0 441m 15m 5484 R 29.0 0.4 0:45.76 httpd
2740 apache 20 0 441m 15m 5728 R 28.6 0.4 1:17.16 httpd
2727 apache 20 0 450m 24m 5784 R 28.0 0.7 2:15.40 httpd
2734 apache 20 0 455m 29m 5680 R 28.0 0.8 1:30.39 httpd
2742 apache 20 0 451m 24m 5248 R 27.6 0.7 1:30.30 httpd
2737 apache 20 0 441m 16m 5356 R 27.3 0.4 2:31.34 httpd

I released a new version 0.8, can check if bug is resolved?

I will try it first thing tomorrow morning! Thanks for the quick response ๐Ÿ˜„๐Ÿ‘

On Jun 29, 2015, at 4:52 PM, G.Grandes notifications@github.com wrote:

I released a new version 0.8, can check if bug is resolved?

โ€”
Reply to this email directly or view it on GitHub.

The problem still exists. If I leave the ELB health check pointing at a port handled by Apache it more quickly reaches 100% CPU usage for Apache processes. If I point it at another port not handled by Apache it takes more time, but eventually it does the same. Let me know if there is anything I can do to help resolve this issue.

@awelters About If i point it at another port not handled by apache... you say that if you points to port 1234 (o anything else) that is listening by another application (not apache), CPU reaches 100%?

Yes, if I point the ELB health check at another port listening by another application (not apache) the CPU reaches 100%. I open multiple tabs on multiple computers and head to the website, periodically refresh some, others are polling. On the ELB, the idle timeout is set to 300 seconds and connection draining is set to 25 seconds. Any thoughts?

If you need any help testing let me know, you could create a branch, I could check it out, and then send you back the results.

^--- I would be happy to do the same. I have an environment setup and ready to test so if you are able to create a separate branch let me know so we can both provide results.

Thanks @ggrandes!

@awelters If ELB point to other application (not apache)... and mod_myfixip.c run inside apache... this seem like a poltergeist ๐Ÿ‘ป
A) Are you sure that when ELB point to other port, not other ELB check-connection is done from ELB to Apache?
B) If you disable myfixip module inside apache and enable ELB check pointing to other application (not apache), CPU rise again to 100%?

@niczak can you describe your environment?

Sorry for the confusion, but the ELB load balances against port 80/443 with Apache, Health Checks can be done against other ports which I did for testing. More importantly, when mod_myfixip is turned off the 100% CPU goes away. It is most definitely this mod.

On Jun 30, 2015, at 4:45 PM, G.Grandes notifications@github.com wrote:

@awelters If ELB point to other application (not apache)... and mod_myfixip.c run inside apache... this seem like a poltergeist
A) Are you sure that when ELB point to other port, not other ELB check-connection is done from ELB to Apache?
B) If you disable myfixip module inside apache and enable ELB check pointing to other application (not apache), CPU rise again to 100%?

โ€”
Reply to this email directly or view it on GitHub.

@awelters ok, no problem... I suppose that ELB is sending fragmented data in some cases, and mod expect no partial sending of proxy header line. I'm working on this.

Hi, new version 0.9 released. please, check and send me your feedback!

Hi guys! how it goes? version 0.9 solve the problem?
@awelters @niczak

Unfortunately I had to go a different route in order to meet deadlines but this is still something I would like to revisit. Blew away test environment but will definitely give this a go when time allows. @awelters - is your test env still up?

the test env is no longer up. I will try to retest in the coming weeks. Thanks @ggrandes for the effort on this, very important project!

OK, I have tested the new release in a ELB environment and all works right. Thanks for all!

closed by [43e0ab4]