SSL handshake interruption causes hard fault
0xDBFB7 opened this issue · 19 comments
Somewhat related to #193, if a socket times out about 20 times the board hard faults (unrecoverable).
Socket Open
Socket has timed out: [Errno 104] ECONNRESET
Socket Close
Socket Open
assertion "0" failed: file "../../py/../extmod/modussl_mbedtls.c", line 151, function: socket_new
abort() was called at PC 0x40131ad3 on core 0
Backtrace: 0x400873fb:0x3ffce020 0x40087427:0x3ffce040 0x40131ad3:0x3ffce060 0x400ecf2b:0x3ffce090 0x400df716:0x3ffce0e0 0x400dbbc1:0x3ffce110 0x400dbc29:0x3ffce130 0x400e8c09:0x3ffce150 0x400df7a4:0x3ffce1f0 0x400dbbc1:0x3ffce220 0x400e8b7f:0x3ffce240 0x400df7a4:0x3ffce2e0 0x400dbbc1:0x3ffce360 0x400e8b7f:0x3ffce380 0x400df7a4:0x3ffce420 0x400dbbc1:0x3ffce450 0x400dbbee:0x3ffce470 0x400f7352:0x3ffce490 0x400f75b5:0x3ffce530 0x400eeea7:0x3ffce550
Rebooting...
���x����xx�������
The relevant code:
def verify_server(socket):
certHash = binascii.hexlify(
hashlib.sha256(socket.getpeercert(True)).digest())
print("Certificate Hash: " + str(certHash))
for server in server_dict:
if certHash in server['fingerprints']:
return True
return False
def send_to_server(data, url, server_index=0):
wdt.feed()
print("Sending " + str(data) + "to server " + url)
s = socket.socket()
print("Socket Open")
try:
s.settimeout(30)
# server redundancy should be implemented. #FIXME: This will probably
# do a dns lookup every time - probably shouldn't.
ai = socket.getaddrinfo(server_dict[server_index]['address'], 443)
addr = ai[0][-1] # cache this
s.connect(addr)
s = ssl.wrap_socket(s, cert=sump_cert_2.CERT, key=sump_cert_2.KEY)
if not verify_server(s):
print("Certificate doesn't match.")
s.close()
return None
server = server_dict[server_index]
http_headers = '\r\n'.join([
"POST {} HTTP/1.1".format(url),
"Connection: Keep-Alive",
"Accept: */*",
"Host: {}".format(server['address']),
"User-Agent: SafeSumpSeldon",
"Content-Type: application/x-msgpack",
"Content-Length: {}".format(len(data)),
"\r\n"])
s.write(http_headers)
s.write(data)
response = s.read(2048)
except OSError as e:
print("Socket has timed out: {}".format(e))
except Exception as e:
print("Server connection failed: {}".format(e))
store_in_backlog(data)
else:
return response
# process_backlog()
finally:
s.close()
print("Socket Close")
return None
We've been having numerous issues with sockets hardfaulting recently - it's a pretty serious issue for us.
I'm running the latest esp-idf (I got constant #193 errors otherwise).
Thanks!
@danicampora That's usually a memory allocation error, if the mbedtls documentation is to be believed.
@MrSurly The plot thickens.
Tried running
for i in range(0,500):
try:
send_to_server()
except:
pass
If the server's off, I get:
Sending test to server /update
Socket Open
Socket has timed out: [Errno 104] ECONNRESET
OSError:
Interesting that it never gets to s.close()? Weird. Anyhow, gc.mem_free() counts down by 0.5k every loop, and after about 60 cycles I get a string of OSErrors. Adding gc.collect() fixes that.
Even weirder, though, is what happens if the ssl handshake gets interrupted (server gets killed, wifi drops, etc):
Socket Open
mbedtls_ssl_handshake error: -68
Socket has timed out: [Errno 5] EIO
gc.mem_free(): 45552
...
Socket Open
Socket has timed out: [Errno 104] ECONNRESET
42000
Sending test to server
Socket Open
assertion "0" failed: file "../../py/../extmod/modussl_mbedtls.c", line 151, function: socket_new
abort() was called at PC 0x40131a87 on core 0
Backtrace: 0x400873fb:0x3ffce0f0 0x40087427:0x3ffce110 0x40131a87:0x3ffce130 0x400ecf2b:0x3ffce160 0x400df716:0x3ffce1b0 0x400dbbc1:0x3ffce1e0 0x400dbc29:0x3ffce200 0x400e8c09:0x3ffce220 0x400df7a4:0x3ffce2c0 0x400dbbc1:0x3ffce2f0 0x400e8b7f:0x3ffce310 0x400df7a4:0x3ffce3b0 0x400dbbc1:0x3ffce430 0x400dbbee:0x3ffce450 0x400f7352:0x3ffce470 0x400f759c:0x3ffce510 0x400eeed8:0x3ffce550
Rebooting...
Interesting that it never gets to s.close()? Weird. Anyhow, gc.mem_free() counts down by 0.5k every loop, and after about 60 cycles I get a string of OSErrors. Adding gc.collect() fixes that.
These are all related; close de-allocates the memory associated with the mbedtls objects internal to the μPy SSL object, but if it's never called, then there's a memory leak, but one that's outside of the MicroPython heap.
The μPy garbage collector will destroy the μPy SSL object, but the internal mbedtls SSL objects don't get free-ed, because there's no finalizer for μPy SSL objects.
There are two problems to be fixed:
- Figure out why the
finallyblock (and alsoclose())is never called in your example. - Add a finalizer to
modussl_mbedtls.cfor SSL objects that eventually hit the GC that haven't beenclose-ed properly.modsocket.chas a finalizer for this purpose.
I modified your code a bit so I could run it here, and it's working for me. It errors out, but s.close() is being called.
#!/usr/bin/env python3
import socket
import ssl
import binascii
import sys
def send_to_server(data, url, server_index=0):
print("Sending " + str(data) + "to server " + url)
s = socket.socket()
print("Socket Open")
address = '127.0.0.1'
try:
s.settimeout(30)
# server redundancy should be implemented. #FIXME: This will probably
# do a dns lookup every time - probably shouldn't.
ai = socket.getaddrinfo(address, 443)
addr = ai[0][-1] # cache this
s.connect(addr)
s = ssl.wrap_socket(s, cert=sump_cert_2.CERT, key=sump_cert_2.KEY)
http_headers = '\r\n'.join([
"POST {} HTTP/1.1".format(url),
"Connection: Keep-Alive",
"Accept: */*",
"Host: {}".format(address),
"User-Agent: SafeSumpSeldon",
"Content-Type: application/x-msgpack",
"Content-Length: {}".format(len(data)),
"\r\n"])
s.write(http_headers)
s.write(data)
response = s.read(2048)
except OSError as e:
print("Socket has timed out: {}".format(e))
sys.print_exception(e)
except Exception as e:
print("Server connection failed: {}".format(e))
sys.print_exception(e)
else:
return response
# process_backlog()
finally:
s.close()
print("Socket Close")
return None
def test():
for i in range(0, 500):
try:
send_to_server('hello', 'strawberry.local')
except Exception as e:
print ('outer exception', e)
sys.print_exception(e)I've updated the ESP IDF to the latest version (note that it requires updating the xtensa toolchain), so the behaviour of this bug may have changed.
@dpgeorge Yep:
I'm not getting that weird socket close behavior, all timeout mem-leaks are gone, however, if the SSL connection is interrupted for some reason, the thing still crashes:
Socket Open
Socket has timed out: [Errno 104] ECONNRESET
Socket Close
129
Sending test to server /update
Socket Open
assertion "0" failed: file "../../py/../extmod/modussl_mbedtls.c", line 151, function: socket_new
abort() was called at PC 0x401323ff on core 0
Rebooting...
Interestingly, it's not unrecoverable anymore - the esp completes rebooting and functions fine again. I think I'm okay with that behavior for now.
Thanks!
@BasedOnTechnology the assertion failing on line 151 looks like it can only happen due to an out-of-memory condition in mbedtls and is not related to the actual socket connection. How exactly do you trigger this error? Is it reproducible?
@dpgeorge What do you think about adding finalizer to SSL sockets? Regular sockets already have one. I think that that could definitely be one cause of this, since close for SSL is the only thing that releases mbedtls RAM, and it's not called during GC.
What do you think about adding finalizer to SSL sockets?
Yes it makes sense to do that.
I'll add it to mainline, though I don't have HW to test.
@dpgeorge Yep, the error's completely reproducible. I have control over the server the board is connecting to - if I kill it (simulating a drop in wifi, say), the board always resets.
Sorry guys, but that didn't seem to do it.
[I kill the server]
Socket Open
mbedtls_ssl_handshake error: -7280
Socket has timed out: [Errno 5] EIO
Socket Close
[that's normal.]
[I kill the server in the middle of a handshake]
Socket Open
mbedtls_ssl_handshake error: -68
Socket Close
Socket Open
mbedtls_ssl_handshake error: -10
Socket has timed out: [Errno 5] EIO
Socket Close
Traceback (most recent call last):
File "main.py", line 550, in <module>
File "main.py", line 527, in process_data
AttributeError: 'NoneType' object has no attribute 'decode'
I (303267) wifi: bcn_timout,ap_probe_send_start
I (305767) wifi: ap_probe_send over, resett wifi status to disassoc
I (305767) wifi: state: run -> init (1)
I (305767) wifi: n:5 0, o:5 0, ap:255 255, sta:5 0, prof:1
I (305767) wifi: pm stop, total sleep time: 0/297598933
[this isn't unrelated, I never get these messages otherwise]
I (305767) wifi: STA_DISCONNECTED, reason:200
beacon timeout
I (308187) wifi: STA_DISCONNECTED, reason:201
no AP found
I (310597) wifi: STA_DISCONNECTED, reason:201
no AP found
I (313007) wifi: STA_DISCONNECTED, reason:201
no AP found
I (315417) wifi: STA_DISCONNECTED, reason:201
no AP found
[server's back up]
Socket Open
mbedtls_ssl_handshake error: -4290
Socket has timed out: [Errno 5] EIO
Socket Close
[and now I can never get a handshake again]
Socket Open
Socket has timed out: 118
Socket Close
Here's the bare minimum code that still reproduces this issue.
Make an ssl server with
openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes
openssl s_server -key key.pem -cert cert.pem -accept 0.0.0.0:8000 -www
With the latest daily build from micropython.org, run
import network
import socket
import ussl
import utime
wlan = network.WLAN(network.STA_IF)
wlan.active(True)
wlan.connect(wifinetwork, password)
utime.sleep(10)
while(True):
try:
s = socket.socket()
s.settimeout(30)
ai = socket.getaddrinfo('192.168.0.42', 8000)
addr = ai[0][-1]
s.connect(addr)
s = ussl.wrap_socket(s)
s.close()
except:
s.close()
print('Timeout')
Depending on if you kill the server with Ctrl-C after the 10 second timeout, restart the server, disconnect wifi, or just leave it running, you'll see a bunch of different behaviors.
If I just leave it running with the server off, for instance, eventually I get a string of
I (89158) wifi: bcn_timout,ap_probe_send_start
I (91658) wifi: ap_probe_send over, resett wifi status to disassoc
I (91658) wifi: state: run -> init (1)
I (91668) wifi: n:5 0, o:5 0, ap:255 255, sta:5 0, prof:1
I (91668) wifi: pm stop, total sleep time: 0/87161565
I (91668) wifi: STA_DISCONNECTED, reason:200
beacon timeout
I (94088) wifi: STA_DISCONNECTED, reason:201
no AP found
Thanks for looking into this.
Adding gc.collect() seems to completely solve the issue now.
Adding gc.collect() seems to completely solve the issue now.
That's odd. I've seen that before, but only with the Pycom fork.
Edit: Maybe because there aren't enough µPy objects to trigger the GC, but they are consuming considerable FreeRTOS heap memory. Which is odd, because calling close() should perform the same cleanup that gc.collect() would trigger via the finaliser.
I think I found the issue: there's only enough RAM (on the FreeRTOS heap) to create 2 SSL connections. If you try to create a third while the others are active then it will hit an assertion because mbedtls can't allocate any more memory.
And this case is being hit in the above test by the following sequence of events:
- server dies during a handshake, leading to a memory leak on esp32
- server restarts, connections can be made
- server dies again during a handshake, leading to another memory leak
- server restarts
- next connection fails due to out-of-memory
The fix (applied temporarily in this repo in 54c6ebc, @BasedOnTechnology please can you test this fix?) is to clean up the mbedtls state when there is an error during the creation of the ssl object.
@dpgeorge Thanks for the fix, I haven't encountered any issues with that commit. Tentatively, it looks like it's solved the problem.
The fix mentioned just above was merged upstream some time ago.