TCP handle leakage for "stale" connections

Question

TCP handle leakage for "stale" connections

shsmith opened this issue 8 years ago · 31 comments

I have noticed that over time the number of IP handles reported by lsof continues to grow and far exceeds the number of connections reported by the sessions RPC.

One IP in particular appears over 1000 times, even though there is no active session for the IP.

Here are the logs from 2 days ago when this IP briefly connected:

INFO:ElectrumX:[21536] SSL 41.144.95.105:52782, 666 total
INFO:ServerManager:closing stale connections [19859, 21536, 20904, 21120, 6123, 21278, 21097, 21124, 21481]

This IP does not appear at all after that.

Possibly related: the "subs" value in the getinfo RPC is much higher than the sum of subs for the sessions RPC. I suspect these are related to the stuck TCP handles.

Live Munin charts: http://vps.hsmiths.com:49001/munin/electrumx-day.html

A snapshot from near block 446220:

full lsof output: http://vps.hsmiths.com:49001/lsof.txt
getinfo output: http://vps.hsmiths.com:49001/getinfo
sessions output: http://vps.hsmiths.com:49001/sessions

Answer 1 · 2017-01-02T07:14:20.000Z

I just noticed this having gotten back from a 10 day break. My subs are over-counted, and my groups list way more sessions than exist.

Answer 2 · 2017-01-07T06:38:31.000Z

As this was a bug, I've done a backport to 0.9.23. I don't anticipate backporting anything else to 0.9.x.

Answer 3 · 2017-01-10T00:15:46.000Z

I think the leak is still happening.
http://h.hsmiths.com:49001/munin/localdomain/localhost.localdomain/electrumx_lsof-day.png

Using this script I can see that there are currently over 500 IPv4 connections that do not show up in the 'sessions' rpc:

# report lsof for electrumx, omitting known sessions (looking for leakage)

#get electrumx pid
pid=`electrumx_rpc.py getinfo | grep pid | grep -o '[0-9]*'`

#get list of session IP:PORT and convert newline to " -e "
iplist=`electrumx_rpc.py sessions | grep -v Peer | sed -E "s/\s+/ /g" | cut -d" " -f 12 | tr '\n' ' ' | sed 's/ / -e /g'`

#get open IPv handles, omit session addresses and listening ports
lsof -n -p $pid | grep IPv | grep -v -e $iplist LISTEN

This is from the raspberry pi, which I restarted this morning.
http://h.hsmiths.com:49004/munin/localdomain/localhost.localdomain/electrumx_lsof-day.png

You can see that for the first hour or so the open IPv count tracked sessions, but after that IPv got ahead of sessions. Looks like some sessions are ending but not actually closing the socket.

Answer 4 · 2017-01-10T02:09:20.000Z

I searched my logs for the leaked addresses reported by the script and found very few hits.
I suspect these are not electrum connections, but something else scanning the port and getting stuck.

As a test, I did a simple telnet to port 50002 on one server and telnet just sat there with a blank screen. ElectrumX did not log this as a session.
I waited over 10 minutes and the telnet session was still open and waiting for input.

Maybe we need some sort of an initial handshake timeout.
If a connection is accepted but no valid SSL handshake and/or no RPC commands are received in x seconds, the socket should be closed.

Answer 5 · 2017-01-11T02:45:39.000Z

I cannot see any way asyncio allows one to control this. A TCP connection goes through immediately to connection_made (and therefore is logged) on creation, but because SSL is wrapped, it doesn't call connection_made until the handshake is complete. The Protocol class is constructed when the initial socket connection is made, but there are no details available of that socket to my application until the handshake completes. If it doesn't, it seems to just sit there forever.

Answer 6 · 2017-01-11T23:20:37.000Z

I wonder if something like this would work:
socket.setdefaulttimeout(env.session_timeout)

Answer 7 · 2017-01-11T23:35:44.000Z

The problem I have is I don't have a socket.

…

On Thu, 12 Jan 2017, 08:20 shsmith, ***@***.***> wrote: I wonder if something like this would work: socket.setdefaulttimeout(env.session_timeout) — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#94 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADKliN2KR3IeLNDDspqeIN9cl5I948-nks5rRWPGgaJpZM4LY2Vm> .

Answer 8 · 2017-01-11T23:40:58.000Z

I thought it was a static method in the socket class and applied to all newly created sockets.

Answer 9 · 2017-01-11T23:48:00.000Z

https://docs.python.org/3/library/socket.html?highlight=socket#socket.setdefaulttimeout

Answer 10 · 2017-01-13T01:05:55.000Z

shsmith@c84f68c didn't help because asyncio is using nonblocking sockets.

I suspect the stuck sockets are looping here:
https://github.com/python/asyncio/blob/374930d766c347f126c88d61a75c17a7dd3a2450/asyncio/sslproto.py#L201

This code block does not have any kind of timeout, but one could be added to make use of self._handshake_start_time.

Answer 11 · 2017-02-28T22:24:50.000Z

I noticed there was a commit late last year to asyncio that might help this - if you have 3.6 installed can you see if the issue still exists there?

Answer 12 · 2017-03-01T01:46:29.000Z

if you have 3.6 installed can you see if the issue still exists there?

I am still seeing the leakage on both my systems running Python 3.6.
For example: http://imgur.com/a/4e75b

After about 10:00 you can see that the number of ipv4 handles advances beyond the number of sessions.
This is similar to the leakage pattern I saw with Python 3.5.2 on the same machine.

Answer 13 · 2017-03-01T02:30:39.000Z

Do you have the one line from python/asyncio@d84a8cb in sslproto.py? I can't see if that got into 3.6 or not

Answer 14 · 2017-03-01T04:10:07.000Z

Yes, I found that line in /usr/lib/python3.6/asyncio/sslproto.py.
While browsing for that bit of code I found that various debugging messages can be enabled by setting the PYTHONASYNCIODEBUG environment. I'm running with that set and will check back later to see if it provided any clues.

Answer 15 · 2017-03-01T04:11:42.000Z

Shame, I was hoping that was it.

Be prepared for very verbose output!

Answer 16 · 2017-03-21T16:02:24.000Z

I'm hitting the same problem when running an Electrumx 1.0 server. Unfortunately one client seems to use this to intentionally exhaust server resources.
I see the cause seems to be bug python/asyncio#483
But I wonder if there shouldn't be a general limit of accepted connections per peer IP? In a way when there are already three or four connections open, any additional connection attempt would be denied before opening yet another socket.

Answer 17 · 2017-03-29T15:06:31.000Z

I applied this patch and leakage is much reduced. Patch works on both Python 3.5.2 and 3.6.
python/cpython#480

Answer 18 · 2017-03-31T14:41:24.000Z

After longer run time the leak still exists with python/cpython#480 applied.
Pr 480 may help, but does not solve the majority of leakage.

http://electrum.hsmiths.com:49001/munin/hsmiths.com/electrum.hsmiths.com/electrumx_lsof-week.png

http://vps.hsmiths.com:49001/munin/hsmiths.com/vps.hsmiths.com/electrumx_lsof-week.png

When I applied the patch I added a warning to I could be sure the patch was going into effect.
The warning is frequently logged. Here is an example:

Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9844] SSL 38.81.65.42:59465, 569 total
Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9844] cannot decode JSON: Expecting value: line 1 column 1 (char 0)
Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9845] SSL 69.63.165.252:50987, 570 total
Mar 31 07:37:23 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9846] SSL 92.30.192.55:53631, 564 total
Mar 31 07:37:28 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9847] SSL 209.222.77.220:36122, 563 total
Mar 31 07:37:29 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9848] SSL 157.192.81.193:49231, 562 total
Mar 31 07:37:31 electrumd electrumx_server.py[1186]: WARNING:asyncio:pr480: abort socket via start_shutdown if in_handshake
Mar 31 07:37:33 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9849] SSL 209.222.77.220:44942, 563 total
Mar 31 07:37:33 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9850] SSL 62.250.99.225:51011, 564 total
Mar 31 07:37:37 electrumd electrumx_server.py[1186]: WARNING:asyncio:pr480: abort socket via start_shutdown if in_handshake
Mar 31 07:37:44 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9851] SSL 37.187.100.2:38500, 563 total
Mar 31 07:37:46 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9852] SSL 62.210.246.163:43342, 564 total
Mar 31 07:37:47 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9853] SSL 87.182.127.196:56363, 564 total
Mar 31 07:37:48 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9854] SSL 62.210.246.163:43914, 565 total

Answer 19 · 2017-12-11T15:56:45.000Z

Has there been any further progress? At the moment lsof is showing hundreds of connections from one IP that never appears in the log at all. The previous instance exhausted its 10,000 file handles presumably for the same reason. Are regular restarts the only solution?

Answer 20 · 2017-12-11T16:00:32.000Z

Until it is fixed upstream in Python, yes, the only solution is to restart.

Answer 21 · 2017-12-11T16:09:33.000Z

To mitigate the issue, you can use iptables to restrict IPs to N connections per IP (I use 8)

iptables -A INPUT -p tcp --syn --dport 50002 -m connlimit --connlimit-above 8 -j REJECT

Also, as a super ugly hack, you can periodically use lsof to compare open and established connections with electrumx_rpc sessions, then use gdb to force those spare connections closed. It's not guaranteed to be side-effect-free, but I've been using it a while and it seems helpful. Here's the PHP script I wrote to to that (I told you it was super ugly..) https://gist.github.com/erorus/57f63ff486413746915127cdfa30dc9e

Answer 22 · 2017-12-11T20:22:56.000Z

Thanks @erorus, I have that iptables rule in place now and it's helping keep things under control.

I know it needs to be fixed in Python, but is there a Python bug report active at the moment? I know that python/asyncio#483 was opened by @kyuupichan but that hasn't moved since the proposed fix (python/cpython#480), and from @shsmith's comments above the problem persists after that is applied. In fact the asyncio repo containing that issue has been closed since March, so it seems unlikely to ever be revisited - new asyncio bugs are meant to be raised at https://bugs.python.org.

Answer 23 · 2017-12-12T00:38:18.000Z

https://bugs.python.org/issue29970

It's not getting much attention

Answer 24 · 2017-12-12T17:59:43.000Z

Thanks, I've suggested a fix on that issue.
A monkey patch of that idea in case anyone wants to try it: https://gist.github.com/mocmocamoc/8544c288a486e95514fc982a7a0e8620

Answer 25 · 2017-12-13T03:44:49.000Z

@mocmocamoc where does this patch go? in one of the electrumx files? could you post a diff of the patched file. Thanks!

Answer 26 · 2017-12-13T04:03:35.000Z

@mocmocamoc I found your notes here: https://bugs.python.org/issue29970 and was able to apply the monkeypatch to /usr/lib/python3.6/asyncio/sslproto.py.
I will report back test results in a few days.

Answer 27 · 2017-12-14T16:56:54.000Z

Good news! The leakage is completely stopped with this insertion: https://github.com/python/cpython/pull/4825/files#diff-0827a8b032e7f279fa8f66eee271f6ceR559

Thanks @mocmocamoc! This should be merged upstream.

Answer 28 · 2017-12-17T22:54:09.000Z

I can also concur that merging pull #4825 fixed the handle leakage. It is an issue independent of electrumx, but I feel there should be a mention of the issue, and the solution, in the docs.

Answer 29 · 2017-12-18T00:48:29.000Z

Once it is fixed upstream I will make that version of Python the recommended minimum.

Answer 30 · 2017-12-20T12:12:15.000Z

python/cpython#4825 has been merged so this should be fixed as of Python 3.7.0a4.

Answer 31 · 2018-02-08T15:18:10.000Z

As this will be fixed in Python 3.7 I will close this issue.