Memory not released
Zabrane opened this issue · 12 comments
Hi guys,
I'm facing the same issue using Hitch 1.7.0
on Ubuntu 20.04 LTS
.
While stress testing (with vegeta) our backend app which sits behind Hitch, we noticed that Hitch's memory never gets released back to the system.
This is Hitch's memory usage before starting the benchmark (using ps_mem.py to track memory usage)
Private + Shared = RAM used Program
5.2 MiB + 1.7 MiB = 6.9 MiB hitch (10)
And this is Hitch's memory usage when the benchmark was done:
Private + Shared = RAM used Program
2.51 GiB + 192.1 MiB = 2.7 GiB hitch (10)
The memory is still not released yet (24h later).
My config:
- Ubuntu
20.04 LTS
- Hitch
1.7.0
- OpenSSL
1.1.1f
- GCC
9.3.0
- Only one SSL certificate
Hi @Zabrane
Thanks for the report, I will take a look.
Could you share some details of the benchmark you ran? Is this a handshake oriented or a throughput oriented test? HTTP kee-alive? Number of clients/request rate?
Also, is there anything else special about your config? Could you perhaps share your hitch command line and hitch.conf?
Hi @daghf
Thanks for taking the time to look at this.
Here are the steps to reproduce the issue:
- install
Express
to run theNodeJS
backend sample server (file srv.js.zip)
$ unzip -a srv.js.zip
$ npm install express
$ node srv.js
::: listening on http://localhost:7200/
- Use the latest Hitch 1.7.0 with the following
hitch.conf
(point pem-file to yours).
We were able to reproduce this memory issue from version1.5.0
to1.7.0
.
## Listening
frontend = "[0.0.0.0]:8443"
## https://ssl-config.mozilla.org/
ciphers = "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"
tls-protos = TLSv1.2
## TLS for HTTP/2 traffic
alpn-protos = "http/1.1"
## Send traffic to the backend without the PROXY protocol
backend = "[127.0.0.1]:7200"
write-proxy-v1 = off
write-proxy-v2 = off
write-ip = off
## List of PEM files, each with key, certificates and dhparams
pem-file = "hitch.pem"
## set it to number of cores
workers = 10
backlog = 1024
keepalive = 30
## Logging / Verbosity
quiet = on
log-filename = "/dev/null"
## Automatic OCSP staple retrieval
ocsp-verify-staple = off
ocsp-dir = ""
Then, run it:
$ hitch -V
hitch 1.7.0
$ hitch --config=./hitch.conf
- Check if the pieces are successfully connected:
$ curl -k -D- -q -sS "https://localhost:8443/" --output /dev/null
HTTP/1.1 200 OK
X-Powered-By: Express
Content-Type: application/json; charset=utf-8
Content-Length: 6604
Date: Tue, 22 Dec 2020 12:01:33 GMT
Connection: keep-alive
Finally, run it like this:
$ echo "GET https://localhost:8443/" | vegeta attack -insecure -header 'Connection: keep-alive' -timeout=2s -rate=1000 -duration=1m | vegeta encode | vegeta report
Requests [total, rate, throughput] 60000, 1000.02, 1000.02
Duration [total, attack, wait] 59.999s, 59.999s, 219.979µs
Latencies [min, mean, 50, 90, 95, 99, max] 165.935µs, 262.688µs, 230.6µs, 333.352µs, 375.975µs, 502.351µs, 16.373ms
Bytes In [total, mean] 396240000, 6604.00
Bytes Out [total, mean] 0, 0.00
Success [ratio] 100.00%
Status Codes [code:count] 200:60000
Error Set:
During the stress test with vegeta
, check hitch
memory usage (top, htop or ps_mem):
$ sudo su
root$ ps_mem.py -p `pgrep -d, hitch | sed -e 's|,$||'`
root$ watch -n 3 "ps_mem.py -p `pgrep -d, hitch | sed -e 's|,$||'`"
You can set vegeta's -duration
option to a larger value (ex. 15m
) to see the memory effect on Hitch.
Please let me know if you need anything else.
NOTE: on MacOS, top
shows that hitch 1.7.0
uses only 02
workers despite the fact they are set to 10
Hi @Zabrane
I haven't had any luck in reproducing this.
Even trying to set up something identical to your setup (Ubuntu 20.04, gcc9.3, openssl 1.1.1f), and running vegeta with your Express
as backend - I still did not see memory usage creep much above 50M.
I did find a few inconsequential small memory leaks relating to a config file update, which I fixed in a commit just pushed. However, these are not the kind of memory leaks that would incur growing memory usage relating to traffic or running time.
@daghf thanks for your time looking at this issue.
We are still seeing this behaviour in 2 different products behind hitch
. It's a bit sad you weren't able to reproduce it.
One last question before i close this issue if you don't mind: if the backend server
decides to close the connection after servicing some requests, will hitch
reopen it immeditaley?
Or will it wait till a new client connection is established?
Thanks
Have the same problem here; hitch is currently taking up to 24GB of ram until it was killed (Out of memory: Kill process # (hitch) score 111 or sacrifice child.
Seems to only start happening after our latest update to 1.7.0. Not sure what version we were running before.
@robinbohnen thanks for confirming the issue. We still suffer from the memory problem and the current workaround is to manually kill/restart hitch
(yes, a hack with a bad consequence of losing connections).
We consider switching to stunnel 5.58, haproxy 2.3 or envoy 1.17.
caveat: the stunnel
link is an old blog post againt stud
(hitch ancestor). But we were able to reproduce those numbers (even better ones) as of today:
@Zabrane , since we are having trouble reproducing the issue, could you try either sharing some docker-compose
or vagrant
file so we can look at it locally? is there anything special about your certificates? (large numbers, lots of intermediate CA, complete options, etc.)
@gquintard we use 1 Certificate and 1 CA as explained above. Unfortunately, we don't rely on Docker for our services. It took us 6-weeks to able to report the issue here ( get approval from business - we work for a private bank).
@robinbohnen could you please shed more lights on your config?
We have about 3500 LetsEncrypt certificates served by Hitch, we don't use Docker as well.
I think what @gquintard was asking is rather, can you reproduce this behavior in a docker or vagrant (or maybe other) setup that we could duplicate on our end to try to observe it as well?
FWIW, we observed something similar. In our case we had 300-500K concurrent connections, when the connection count dropped RSS continued to increase, until stablizing around 90GB.
After trying a variety of adjustments we ended up loading jemalloc via LD_PRELOAD
. With that change RSS became much more correlated with the number of connections (26-44GB).
I don't have a firm explanation, but it does remind me a bit of this post where it's theorized that the excess memory usage of libc malloc involved fragmentation caused by multithreading. I'm not sure if that would apply in hitch.