TCP handle leakage for "stale" connections
shsmith opened this issue · 31 comments
I have noticed that over time the number of IP handles reported by lsof continues to grow and far exceeds the number of connections reported by the sessions RPC.
One IP in particular appears over 1000 times, even though there is no active session for the IP.
Here are the logs from 2 days ago when this IP briefly connected:
INFO:ElectrumX:[21536] SSL 41.144.95.105:52782, 666 total
INFO:ServerManager:closing stale connections [19859, 21536, 20904, 21120, 6123, 21278, 21097, 21124, 21481]
This IP does not appear at all after that.
Possibly related: the "subs" value in the getinfo RPC is much higher than the sum of subs for the sessions RPC. I suspect these are related to the stuck TCP handles.
Live Munin charts: http://vps.hsmiths.com:49001/munin/electrumx-day.html
A snapshot from near block 446220:
full lsof output: http://vps.hsmiths.com:49001/lsof.txt
getinfo output: http://vps.hsmiths.com:49001/getinfo
sessions output: http://vps.hsmiths.com:49001/sessions
I just noticed this having gotten back from a 10 day break. My subs are over-counted, and my groups list way more sessions than exist.
As this was a bug, I've done a backport to 0.9.23. I don't anticipate backporting anything else to 0.9.x.
I think the leak is still happening.
http://h.hsmiths.com:49001/munin/localdomain/localhost.localdomain/electrumx_lsof-day.png
Using this script I can see that there are currently over 500 IPv4 connections that do not show up in the 'sessions' rpc:
# report lsof for electrumx, omitting known sessions (looking for leakage)
#get electrumx pid
pid=`electrumx_rpc.py getinfo | grep pid | grep -o '[0-9]*'`
#get list of session IP:PORT and convert newline to " -e "
iplist=`electrumx_rpc.py sessions | grep -v Peer | sed -E "s/\s+/ /g" | cut -d" " -f 12 | tr '\n' ' ' | sed 's/ / -e /g'`
#get open IPv handles, omit session addresses and listening ports
lsof -n -p $pid | grep IPv | grep -v -e $iplist LISTEN
This is from the raspberry pi, which I restarted this morning.
http://h.hsmiths.com:49004/munin/localdomain/localhost.localdomain/electrumx_lsof-day.png
You can see that for the first hour or so the open IPv count tracked sessions, but after that IPv got ahead of sessions. Looks like some sessions are ending but not actually closing the socket.
I searched my logs for the leaked addresses reported by the script and found very few hits.
I suspect these are not electrum connections, but something else scanning the port and getting stuck.
As a test, I did a simple telnet to port 50002 on one server and telnet just sat there with a blank screen. ElectrumX did not log this as a session.
I waited over 10 minutes and the telnet session was still open and waiting for input.
Maybe we need some sort of an initial handshake timeout.
If a connection is accepted but no valid SSL handshake and/or no RPC commands are received in x seconds, the socket should be closed.
I cannot see any way asyncio allows one to control this. A TCP connection goes through immediately to connection_made (and therefore is logged) on creation, but because SSL is wrapped, it doesn't call connection_made until the handshake is complete. The Protocol class is constructed when the initial socket connection is made, but there are no details available of that socket to my application until the handshake completes. If it doesn't, it seems to just sit there forever.
I wonder if something like this would work:
socket.setdefaulttimeout(env.session_timeout)
I thought it was a static method in the socket class and applied to all newly created sockets.
shsmith@c84f68c didn't help because asyncio is using nonblocking sockets.
I suspect the stuck sockets are looping here:
https://github.com/python/asyncio/blob/374930d766c347f126c88d61a75c17a7dd3a2450/asyncio/sslproto.py#L201
This code block does not have any kind of timeout, but one could be added to make use of self._handshake_start_time.
I noticed there was a commit late last year to asyncio that might help this - if you have 3.6 installed can you see if the issue still exists there?
if you have 3.6 installed can you see if the issue still exists there?
I am still seeing the leakage on both my systems running Python 3.6.
For example: http://imgur.com/a/4e75b
After about 10:00 you can see that the number of ipv4 handles advances beyond the number of sessions.
This is similar to the leakage pattern I saw with Python 3.5.2 on the same machine.
Do you have the one line from python/asyncio@d84a8cb in sslproto.py? I can't see if that got into 3.6 or not
Yes, I found that line in /usr/lib/python3.6/asyncio/sslproto.py.
While browsing for that bit of code I found that various debugging messages can be enabled by setting the PYTHONASYNCIODEBUG environment. I'm running with that set and will check back later to see if it provided any clues.
Shame, I was hoping that was it.
Be prepared for very verbose output!
I'm hitting the same problem when running an Electrumx 1.0 server. Unfortunately one client seems to use this to intentionally exhaust server resources.
I see the cause seems to be bug python/asyncio#483
But I wonder if there shouldn't be a general limit of accepted connections per peer IP? In a way when there are already three or four connections open, any additional connection attempt would be denied before opening yet another socket.
I applied this patch and leakage is much reduced. Patch works on both Python 3.5.2 and 3.6.
python/cpython#480
After longer run time the leak still exists with python/cpython#480 applied.
Pr 480 may help, but does not solve the majority of leakage.
http://electrum.hsmiths.com:49001/munin/hsmiths.com/electrum.hsmiths.com/electrumx_lsof-week.png
http://vps.hsmiths.com:49001/munin/hsmiths.com/vps.hsmiths.com/electrumx_lsof-week.png
When I applied the patch I added a warning to I could be sure the patch was going into effect.
The warning is frequently logged. Here is an example:
Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9844] SSL 38.81.65.42:59465, 569 total
Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9844] cannot decode JSON: Expecting value: line 1 column 1 (char 0)
Mar 31 07:37:08 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9845] SSL 69.63.165.252:50987, 570 total
Mar 31 07:37:23 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9846] SSL 92.30.192.55:53631, 564 total
Mar 31 07:37:28 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9847] SSL 209.222.77.220:36122, 563 total
Mar 31 07:37:29 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9848] SSL 157.192.81.193:49231, 562 total
Mar 31 07:37:31 electrumd electrumx_server.py[1186]: WARNING:asyncio:pr480: abort socket via start_shutdown if in_handshake
Mar 31 07:37:33 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9849] SSL 209.222.77.220:44942, 563 total
Mar 31 07:37:33 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9850] SSL 62.250.99.225:51011, 564 total
Mar 31 07:37:37 electrumd electrumx_server.py[1186]: WARNING:asyncio:pr480: abort socket via start_shutdown if in_handshake
Mar 31 07:37:44 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9851] SSL 37.187.100.2:38500, 563 total
Mar 31 07:37:46 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9852] SSL 62.210.246.163:43342, 564 total
Mar 31 07:37:47 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9853] SSL 87.182.127.196:56363, 564 total
Mar 31 07:37:48 electrumd electrumx_server.py[1186]: INFO:ElectrumX:[9854] SSL 62.210.246.163:43914, 565 total
Has there been any further progress? At the moment lsof is showing hundreds of connections from one IP that never appears in the log at all. The previous instance exhausted its 10,000 file handles presumably for the same reason. Are regular restarts the only solution?
Until it is fixed upstream in Python, yes, the only solution is to restart.
To mitigate the issue, you can use iptables to restrict IPs to N connections per IP (I use 8)
iptables -A INPUT -p tcp --syn --dport 50002 -m connlimit --connlimit-above 8 -j REJECT
Also, as a super ugly hack, you can periodically use lsof
to compare open and established connections with electrumx_rpc sessions
, then use gdb
to force those spare connections closed. It's not guaranteed to be side-effect-free, but I've been using it a while and it seems helpful. Here's the PHP script I wrote to to that (I told you it was super ugly..) https://gist.github.com/erorus/57f63ff486413746915127cdfa30dc9e
Thanks @erorus, I have that iptables rule in place now and it's helping keep things under control.
I know it needs to be fixed in Python, but is there a Python bug report active at the moment? I know that python/asyncio#483 was opened by @kyuupichan but that hasn't moved since the proposed fix (python/cpython#480), and from @shsmith's comments above the problem persists after that is applied. In fact the asyncio repo containing that issue has been closed since March, so it seems unlikely to ever be revisited - new asyncio bugs are meant to be raised at https://bugs.python.org.
https://bugs.python.org/issue29970
It's not getting much attention
Thanks, I've suggested a fix on that issue.
A monkey patch of that idea in case anyone wants to try it: https://gist.github.com/mocmocamoc/8544c288a486e95514fc982a7a0e8620
@mocmocamoc where does this patch go? in one of the electrumx files? could you post a diff of the patched file. Thanks!
@mocmocamoc I found your notes here: https://bugs.python.org/issue29970 and was able to apply the monkeypatch to /usr/lib/python3.6/asyncio/sslproto.py.
I will report back test results in a few days.
Good news! The leakage is completely stopped with this insertion: https://github.com/python/cpython/pull/4825/files#diff-0827a8b032e7f279fa8f66eee271f6ceR559
Thanks @mocmocamoc! This should be merged upstream.
I can also concur that merging pull #4825 fixed the handle leakage. It is an issue independent of electrumx, but I feel there should be a mention of the issue, and the solution, in the docs.
Once it is fixed upstream I will make that version of Python the recommended minimum.
python/cpython#4825 has been merged so this should be fixed as of Python 3.7.0a4.
As this will be fixed in Python 3.7 I will close this issue.