Memory not released

Question

Memory not released

Zabrane opened this issue 4 years ago · 12 comments

Hi guys,

I'm facing the same issue using Hitch 1.7.0 on Ubuntu 20.04 LTS.
While stress testing (with vegeta) our backend app which sits behind Hitch, we noticed that Hitch's memory never gets released back to the system.

This is Hitch's memory usage before starting the benchmark (using ps_mem.py to track memory usage)

 Private  +   Shared  =  RAM used       Program
 5.2 MiB +   1.7 MiB =   6.9 MiB       hitch (10)

And this is Hitch's memory usage when the benchmark was done:

 Private  +   Shared  =  RAM used       Program
2.51 GiB +   192.1 MiB =   2.7 GiB       hitch (10)

The memory is still not released yet (24h later).

My config:

Ubuntu 20.04 LTS
Hitch 1.7.0
OpenSSL 1.1.1f
GCC 9.3.0
Only one SSL certificate

Answer 1 · 2020-12-21T10:56:37.000Z

Hi @Zabrane

Thanks for the report, I will take a look.

Could you share some details of the benchmark you ran? Is this a handshake oriented or a throughput oriented test? HTTP kee-alive? Number of clients/request rate?

Also, is there anything else special about your config? Could you perhaps share your hitch command line and hitch.conf?

Answer 2 · 2020-12-22T12:07:55.000Z

Hi @daghf

Thanks for taking the time to look at this.
Here are the steps to reproduce the issue:

install Express to run the NodeJS backend sample server (file srv.js.zip)

$ unzip -a srv.js.zip
$ npm install express
$ node srv.js
::: listening on http://localhost:7200/

Use the latest Hitch 1.7.0 with the following hitch.conf (point pem-file to yours).
We were able to reproduce this memory issue from version 1.5.0 to 1.7.0.

## Listening
frontend   = "[0.0.0.0]:8443"
## https://ssl-config.mozilla.org/
ciphers    = "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"

tls-protos = TLSv1.2

## TLS for HTTP/2 traffic
alpn-protos = "http/1.1"

## Send traffic to the backend without the PROXY protocol
backend        = "[127.0.0.1]:7200"
write-proxy-v1 = off
write-proxy-v2 = off
write-ip       = off

## List of PEM files, each with key, certificates and dhparams
pem-file = "hitch.pem"

## set it to number of cores
workers = 10
backlog = 1024
keepalive = 30

## Logging / Verbosity
quiet = on
log-filename = "/dev/null"

## Automatic OCSP staple retrieval
ocsp-verify-staple = off
ocsp-dir = ""

Then, run it:

$ hitch -V
hitch 1.7.0
$ hitch --config=./hitch.conf

Check if the pieces are successfully connected:

$ curl -k -D- -q -sS "https://localhost:8443/" --output /dev/null
HTTP/1.1 200 OK
X-Powered-By: Express
Content-Type: application/json; charset=utf-8
Content-Length: 6604
Date: Tue, 22 Dec 2020 12:01:33 GMT
Connection: keep-alive

Get vegeta binary for your distribution. No need to compile it, releases are available here.

Finally, run it like this:

$ echo "GET https://localhost:8443/" | vegeta attack -insecure -header 'Connection: keep-alive' -timeout=2s -rate=1000 -duration=1m | vegeta encode | vegeta report
Requests      [total, rate, throughput]         60000, 1000.02, 1000.02
Duration      [total, attack, wait]             59.999s, 59.999s, 219.979µs
Latencies     [min, mean, 50, 90, 95, 99, max]  165.935µs, 262.688µs, 230.6µs, 333.352µs, 375.975µs, 502.351µs, 16.373ms
Bytes In      [total, mean]                     396240000, 6604.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:60000
Error Set:

During the stress test with vegeta, check hitch memory usage (top, htop or ps_mem):

$ sudo su
root$ ps_mem.py -p `pgrep -d, hitch | sed -e 's|,$||'`
root$ watch -n 3 "ps_mem.py -p `pgrep -d, hitch | sed -e 's|,$||'`"

You can set vegeta's -duration option to a larger value (ex. 15m) to see the memory effect on Hitch.

Please let me know if you need anything else.

NOTE: on MacOS, top shows that hitch 1.7.0 uses only 02 workers despite the fact they are set to 10

Answer 3 · 2021-01-06T06:24:02.000Z

Hi @daghf and Happy New Year.

Any update on this :-) ?

Answer 4 · 2021-01-18T08:48:37.000Z

Hi @Zabrane

I haven't had any luck in reproducing this.

Even trying to set up something identical to your setup (Ubuntu 20.04, gcc9.3, openssl 1.1.1f), and running vegeta with your Express as backend - I still did not see memory usage creep much above 50M.

I did find a few inconsequential small memory leaks relating to a config file update, which I fixed in a commit just pushed. However, these are not the kind of memory leaks that would incur growing memory usage relating to traffic or running time.

Answer 5 · 2021-02-03T12:06:56.000Z

@daghf thanks for your time looking at this issue.

We are still seeing this behaviour in 2 different products behind hitch. It's a bit sad you weren't able to reproduce it.

One last question before i close this issue if you don't mind: if the backend server decides to close the connection after servicing some requests, will hitch reopen it immeditaley?
Or will it wait till a new client connection is established?

Thanks

Answer 6 · 2021-02-24T06:47:57.000Z

Have the same problem here; hitch is currently taking up to 24GB of ram until it was killed (Out of memory: Kill process # (hitch) score 111 or sacrifice child.
Seems to only start happening after our latest update to 1.7.0. Not sure what version we were running before.

Answer 7 · 2021-02-25T07:18:45.000Z

@robinbohnen thanks for confirming the issue. We still suffer from the memory problem and the current workaround is to manually kill/restart hitch (yes, a hack with a bad consequence of losing connections).

We consider switching to stunnel 5.58, haproxy 2.3 or envoy 1.17.

caveat: the stunnel link is an old blog post againt stud (hitch ancestor). But we were able to reproduce those numbers (even better ones) as of today:

Answer 8 · 2021-02-25T08:27:35.000Z

@Zabrane , since we are having trouble reproducing the issue, could you try either sharing some docker-compose or vagrant file so we can look at it locally? is there anything special about your certificates? (large numbers, lots of intermediate CA, complete options, etc.)

Answer 9 · 2021-02-25T15:20:03.000Z

@gquintard we use 1 Certificate and 1 CA as explained above. Unfortunately, we don't rely on Docker for our services. It took us 6-weeks to able to report the issue here ( get approval from business - we work for a private bank).

@robinbohnen could you please shed more lights on your config?

Answer 10 · 2021-02-25T15:22:03.000Z

We have about 3500 LetsEncrypt certificates served by Hitch, we don't use Docker as well.

Answer 11 · 2021-02-25T15:54:16.000Z

I think what @gquintard was asking is rather, can you reproduce this behavior in a docker or vagrant (or maybe other) setup that we could duplicate on our end to try to observe it as well?

Answer 12 · 2023-10-23T16:09:06.000Z

FWIW, we observed something similar. In our case we had 300-500K concurrent connections, when the connection count dropped RSS continued to increase, until stablizing around 90GB.

After trying a variety of adjustments we ended up loading jemalloc via LD_PRELOAD. With that change RSS became much more correlated with the number of connections (26-44GB).

I don't have a firm explanation, but it does remind me a bit of this post where it's theorized that the excess memory usage of libc malloc involved fragmentation caused by multithreading. I'm not sure if that would apply in hitch.