Crash during termination when connecting to a server requiring mTLS over HTTP/2
voluntas opened this issue · 17 comments
I have been encountering a crash during termination when I try to connect to a server that requires mTLS via HTTP/2.
When using HTTP/2 to connect, if the server requires mTLS, it appears to fail at Protocol:init in this section, leaving protocol as undefined and proceeding to the disconnect step:
https://github.com/ninenines/gun/blob/master/src/gun.erl#L1259-L1263
Following this, in the disconnect step, it attempts to call undefined:close, which ultimately leads to a crash:
https://github.com/ninenines/gun/blob/master/src/gun.erl#L1742
As I'm not sure how to properly address this, I decided to start by creating an issue here.
I'm encountering an error when setting up a local HTTP/2 server written in Go, setting a root CA only on the server side for mTLS, and then attempting to run it.
1> Uri = #{scheme := Scheme, host := Host, path := Path} = uri_string:parse(<<"https://localhost:48080/test">>).
#{port => 48080,scheme => <<"https">>,path => <<"/test">>,
host => <<"localhost">>}
2> Transport = tls.
tls
3> Port = 48080.
48080
4> Hostname = binary_to_list(Host).
"localhost"
5> application:ensure_all_started(gun).
{ok,[cowlib,gun]}
6> TlsOpts = [{versions, ['tlsv1.3', 'tlsv1.2']}, {verify, verify_none}, {server_name_indication, Hostname}].
[{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}]
7> {ok, ConnPid} = gun:open(Hostname, Port, #{transport => Transport, protocols => [http2], tls_opts => TlsOpts}).
{ok,<0.203.0>}
8> =NOTICE REPORT==== 22-May-2023::23:03:53.704607 ===
TLS client: In state connection received SERVER ALERT: Fatal - Bad Certificate
=ERROR REPORT==== 22-May-2023::23:03:53.713149 ===
** State machine <0.203.0> terminating
** Last event = {internal,
{connected,
{sslsocket,
{gen_tcp,#Port<0.9>,tls_connection,undefined},
[<0.211.0>,<0.210.0>]},
http2}}
** When server state = {connected,
{state,<0.188.0>,
{up,#Ref<0.4004690427.1749286913.204625>},
"localhost",48080,<<"https">>,"localhost",
48080,[],
#{protocols => [http2],
tls_opts =>
[{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}],
transport => tls},
undefined,undefined,gun_tls,true,
{ssl,ssl_closed,ssl_error},
undefined,undefined,undefined,
gun_default_event_h,undefined}}
** Reason for termination = error:{'module could not be loaded',undefined}
** Callback modules = [gun]
** Callback mode = state_functions
** Stacktrace =
** [{gun,disconnect,2,
[{file,"/Users/voluntas/src/erlang/gun/src/gun.erl"},{line,1742}]},
{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1377}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]
=CRASH REPORT==== 22-May-2023::23:03:53.713306 ===
crasher:
initial call: gun:init/1
pid: <0.203.0>
registered_name: []
exception error: undefined function undefined:close/4
in function gun:disconnect/2 (/Users/voluntas/src/erlang/gun/src/gun.erl, line 1742)
in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 1377)
ancestors: [gun_conns_sup,gun_sup,<0.197.0>]
message_queue_len: 0
messages: []
links: [<0.199.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 10958
stack_size: 28
reductions: 26180
neighbours:
=SUPERVISOR REPORT==== 22-May-2023::23:03:53.714281 ===
supervisor: {local,gun_conns_sup}
errorContext: child_terminated
reason: {undef,[{undefined,close,
[{error,closed},
undefined,gun_default_event_h,undefined],
[]},
{gun,disconnect,2,
[{file,"/Users/voluntas/src/erlang/gun/src/gun.erl"},
{line,1742}]},
{gen_statem,loop_state_callback,11,
[{file,"gen_statem.erl"},{line,1377}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,241}]}]}
offender: [{pid,<0.203.0>},
{id,gun},
{mfargs,{gun,start_link,undefined}},
{restart_type,temporary},
{significant,false},
{shutdown,5000},
{child_type,worker}]
With mTLS you have to send the client's certificate I believe and I don't see it configured in your snippet. You will likely need to set {verify, verify_peer}, {cacerts, ...}, {cert, ...}
and maybe a few more options.
Yes, for mTLS, a client certificate must be sent. This case is concerned that if the client certificate required for mTLS is not set, a Bad Certificate will be returned and disconnect will run with Protocol unconfigured, causing a crash.
Not sending the client certificate here is intentional.
If the server is requesting mTLS I want to receive an error and exit without crashing. Is there a better way to do this?
The intent is that we will not know if the server requires mTLS until gun actually tries to connect with TLS.
If the server requires mTLS (client certificate), we would like gun to terminate the connection without crashing and log that a client certificate needs to be set.
This is because client certificates are not always required by our product and we treat them as optional.
Ah fair enough.
Hmm I suppose a trace on the module gun
would help (set it up before gun:open) because I do not see how this could happen from looking at the code. Also please ensure that this is the most recent Gun version just in case.
dbg:tracer().
dbg:tpl(gun, []).
dbg:p(all, c).
gun:open(...)
Thanks for letting me know how to trace! I was having trouble figuring out how to follow it.
Actually, there is another problem with the ssl library returning unexpected errors (I was going to file a separate Issue), so I will investigate things first.
I'll post back here if I find anything.
I was using the latest 2.0.1 version of gun.
Eshell V14.0 (press Ctrl+G to abort, type help(). for help)
1> application:ensure_all_started(gun).
{ok,[cowlib,gun]}
2> application:get_key(gun, vsn).
{ok,"2.0.1"}
Using the trace you gave me, I was able to somewhat understand the problem.
When calling gun_tls:connect/4 i.e. ssl:connect/4,
{ok, TLSSocket} is returned, but a Fatal - Bad Certificate is printed. Then with gun_http2:send ssl calls a connection that has already failed and returns {error, closed}.
If ssl:connect/4 only outputs an error log on Bad Certificate and does nothing else, then this problem seems hard to work around ... I'll check the ssl side.
(<0.257.0>) call gun:initial_tls_handshake(internal,{retries,5,#Port<0.9>},{state,<0.234.0>,
{up,#Ref<0.4232028639.3722706945.249520>},
"localhost",48080,<<"https">>,"localhost",48080,[],
#{protocols => [http2],
tls_opts =>
[{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}],
transport => tls},
undefined,undefined,gun_tls,true,
{ssl,ssl_closed,ssl_error},
undefined,undefined,undefined,gun_default_event_h,undefined})
(<0.257.0>) call gun:ensure_alpn_sni([http2],[{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}],"localhost")
(<0.257.0>) call gun:'-ensure_alpn_sni/3-fun-0-'(http2,[])
(<0.257.0>) call gun:normal_tls_handshake(#Port<0.9>,{state,<0.234.0>,
{up,#Ref<0.4232028639.3722706945.249520>},
"localhost",48080,<<"https">>,"localhost",48080,[],
#{protocols => [http2],
tls_opts =>
[{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}],
transport => tls},
undefined,undefined,gun_tls,true,
{ssl,ssl_closed,ssl_error},
undefined,undefined,undefined,gun_default_event_h,undefined},#{timeout => infinity,
tls_opts =>
[{alpn_advertised_protocols,[<<"h2">>]},
{client_preferred_next_protocols,{client,[<<"h2">>],<<"http/1.1">>}},
{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}]},[http2])
(<0.257.0>) call gun:ensure_alpn_sni([http2],[{alpn_advertised_protocols,[<<"h2">>]},
{client_preferred_next_protocols,{client,[<<"h2">>],<<"http/1.1">>}},
{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}],"localhost")
(<0.257.0>) call gun:'-ensure_alpn_sni/3-fun-0-'(http2,[])
(<0.257.0>) call gun_tls:connect(#Port<0.9>,[{alpn_advertised_protocols,[<<"h2">>]},
{client_preferred_next_protocols,{client,[<<"h2">>],<<"http/1.1">>}},
{alpn_advertised_protocols,[<<"h2">>]},
{client_preferred_next_protocols,{client,[<<"h2">>],<<"http/1.1">>}},
{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}],infinity)
=NOTICE REPORT==== 23-May-2023::09:10:09.751519 ===
TLS client: In state connection received SERVER ALERT: Fatal - Bad Certificate
I put sleep after calling gun_tls and found that after returning the TLSSocket, the Bad Certificate log is output in the first place.
case gun_tls:connect(Socket, TLSOpts, TLSTimeout) of
{ok, TLSSocket} ->
timer:sleep(10000),
(<0.257.0>) call gun_tls:connect(#Port<0.9>,[{alpn_advertised_protocols,[<<"h2">>]},
{client_preferred_next_protocols,{client,[<<"h2">>],<<"http/1.1">>}},
{alpn_advertised_protocols,[<<"h2">>]},
{client_preferred_next_protocols,{client,[<<"h2">>],<<"http/1.1">>}},
{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}],infinity)
=NOTICE REPORT==== 23-May-2023::09:20:40.824080 ===
TLS client: In state connection received SERVER ALERT: Fatal - Bad Certificate
Notes during the investigation are summarized in the following URL
https://gist.github.com/voluntas/7c95e9dea519cecb5ab8b9e5a57f5a72
Summary of this issue
After connecting with gun_tls:connect/4 (ssl:connect/4), when I receive a Bad Certificate alert from the server side, I get {error, closed} when I check the status with ssl:connection_info/1. This is why gun_http2:init/4
This is why a call to gun_tls:send/2 (ssl:send/2) in gun_http2:init/4 returns {error, closed}.
Furthermore, when the {error, closed} is returned in gun_http2:init/4, the protocol in state0 is still undefined and not configured, which causes a crash when gun:disconnect is called.
Since ssl is already closed at this point, {error, closed} will be returned.
https://github.com/ninenines/gun/blob/master/src/gun_http2.erl#L197
Here, #state protocol is undefined, so it crashes on disconnect
https://github.com/ninenines/gun/blob/master/src/gun.erl#L1263
Related issues
Bad Certificate Alert is returned when calling ssl:negotiated_protocol/1 after calling gun_tls:connect/4, ssl:negotiated_protocol/1 may return {error, closed}.
In that case gun_protocol:negotiated/2 will crash because it only expects {error, protocol_not_negotiated} errors.
If the timing is bad, ssl is closed at the timing when TLSSocket is returned and ssl:negotiated_protocol/1 is called, so {error, closed} is returned and a crash occurs.
https://github.com/ninenines/gun/blob/master/src/gun.erl#L1210
This can be reproduced reliably by putting sleep after TLSSocket is returned.
--- .
Therefore, we believe the following modifications are necessary
- If {error, closed} is returned in gun_http2:init/4, set the protocol to a value so that it does not crash when disconnect is called.
- Check for {error, closed} at the time ssl:negotiated_protocol is called
If ssl:negotiated_protocol/1
was returning {error, closed}
the crash would be earlier. The documentation for the function shows that it cannot return {error, closed}
: https://www.erlang.org/doc/man/ssl.html#negotiated_protocol-1 - have you seen a crash there or just the bad certificate error?
The fix would be to set the protocol value before calling disconnect as you mention. Did you try and see if that fixes it?
It seems that calling ssl:negotiated_protocol/1 with a Bad Certificate Alert being sent from the server returns {error, closed}. Perhaps this is an omission in the OTP documentation.
I tried setting a value for Protocol, but gun_http2:close didn't work due to other states, so I'll follow up a bit more.
Thanks again for your quick replies, they are always helpful.
The following crash log is caused by putting TLSSocket to sleep for 1 second after it returns.
This is caused by ssl:negotiated_protocol/1 returning {error, close}.
https://github.com/ninenines/gun/blob/master/src/gun.erl#L1210
I have added ?debugVal/1 to output the log as follows
case gun_tls:connect(Socket, TLSOpts, TLSTimeout) of
{ok, TLSSocket} ->
timer:sleep(1000),
?debugVal(ssl:negotiated_protocol(TLSSocket)),
NewProtocol = gun_protocols:negotiated(ssl:negotiated_protocol(TLSSocket), Protocols),
Protocol = gun_protocols:handler(NewProtocol),
/Users/voluntas/src/erlang/gun/src/gun.erl:1390:<0.236.0>: ssl : negotiated_protocol ( TLSSocket ) = {error,closed}
1> Uri = #{scheme := Scheme, host := Host, path := Path} = uri_string:parse(<<"https://localhost:48080/test">>).
#{port => 48080,scheme => <<"https">>,path => <<"/test">>,
host => <<"localhost">>}
2> Transport = tls.
tls
3> Port = 48080.
48080
4> Hostname = binary_to_list(Host).
"localhost"
5> application:ensure_all_started(gun).
{ok,[cowlib,gun]}
6> application:get_key(gun, vsn).
{ok,"2.0.1"}
7> TlsOpts = [{versions, ['tlsv1.3', 'tlsv1.2']}, {verify, verify_none}, {server_name_indication, Hostname}].
[{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}]
8> {ok, ConnPid} = gun:open(Hostname, Port, #{transport => Transport, protocols => [http2], tls_opts => TlsOpts}).
{ok,<0.1005.0>}
9> =NOTICE REPORT==== 23-May-2023::22:06:53.044778 ===
TLS client: In state connection received SERVER ALERT: Fatal - Bad Certificate
/Users/voluntas/src/erlang/gun/src/gun.erl:1390:<0.236.0>: ssl : negotiated_protocol ( TLSSocket ) = {error,closed}
=ERROR REPORT==== 23-May-2023::22:10:44.597556 ===
** State machine <0.236.0> terminating
** Last event = {internal,{retries,5,#Port<0.9>}}
** When server state = {initial_tls_handshake,
{state,<0.220.0>,
{up,#Ref<0.682986761.3843555332.74398>},
"localhost",48080,<<"https">>,"localhost",
48080,[],
#{protocols => [http2],
tls_opts =>
[{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"localhost"}],
transport => tls},
undefined,undefined,gun_tls,true,
{ssl,ssl_closed,ssl_error},
undefined,undefined,undefined,
gun_default_event_h,undefined}}
** Reason for termination = error:function_clause
** Callback modules = [gun]
** Callback mode = state_functions
** Stacktrace =
** [{gun_protocols,negotiated,
[{error,closed},[http2]],
[{file,"src/gun_protocols.erl"},{line,53}]},
{gun,normal_tls_handshake,4,
[{file,"/Users/voluntas/src/erlang/gun/src/gun.erl"},{line,1391}]},
{gun,initial_tls_handshake,3,
[{file,"/Users/voluntas/src/erlang/gun/src/gun.erl"},{line,1252}]},
{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1377}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]
=CRASH REPORT==== 23-May-2023::22:10:44.597986 ===
crasher:
initial call: gun:init/1
pid: <0.236.0>
registered_name: []
exception error: no function clause matching
gun_protocols:negotiated({error,closed},[http2]) (src/gun_protocols.erl, line 53)
in function gun:normal_tls_handshake/4 (/Users/voluntas/src/erlang/gun/src/gun.erl, line 1391)
in call from gun:initial_tls_handshake/3 (/Users/voluntas/src/erlang/gun/src/gun.erl, line 1252)
in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 1377)
ancestors: [gun_conns_sup,gun_sup,<0.229.0>]
message_queue_len: 0
messages: []
links: [<0.231.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 10958
stack_size: 28
reductions: 26682
neighbours:
=SUPERVISOR REPORT==== 23-May-2023::22:10:44.600972 ===
supervisor: {local,gun_conns_sup}
errorContext: child_terminated
reason: {function_clause,
[{gun_protocols,negotiated,
[{error,closed},[http2]],
[{file,"src/gun_protocols.erl"},{line,53}]},
{gun,normal_tls_handshake,4,
[{file,"/Users/voluntas/src/erlang/gun/src/gun.erl"},
{line,1391}]},
{gun,initial_tls_handshake,3,
[{file,"/Users/voluntas/src/erlang/gun/src/gun.erl"},
{line,1252}]},
{gen_statem,loop_state_callback,11,
[{file,"gen_statem.erl"},{line,1377}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,241}]}]}
offender: [{pid,<0.236.0>},
{id,gun},
{mfargs,{gun,start_link,undefined}},
{restart_type,temporary},
{significant,false},
{shutdown,5000},
{child_type,worker}]
If you specify gun_http2 as Protocol with #state{protocol = Protocol} at the time the error occurs, it will crash when calling gun_http2:close/4 because protocol_state is undefined.
State0#state{protocol = Protocol} will also cause a crash because protocol_state will be empty.
https://github.com/ninenines/gun/blob/master/src/gun.erl#L1263
https://github.com/ninenines/gun/blob/master/src/gun.erl#L1742
https://github.com/ninenines/gun/blob/master/src/gun_http2.erl#L906
Unfortunately, I couldn't think of a good fix for myself.
OK that's where we can get a closed
error: https://github.com/erlang/otp/blob/master/lib/ssl/src/ssl_gen_statem.erl#L1318-L1329 - probably a documentation issue in OTP, I will send a PR later.
As for what to do when gun_http2:init fails, I think calling disconnect
is not the right thing to do. We failed to establish a connection so we should go back to not_connected
like if there was a TCP error. I'll try and see if I can get something I can reproduce so I can write a test for it.
Thanks for the Pull-Request to the OTP side!
I think you are right that it is not a call to disconnect, but rather a state that the connection could not be made.
I would like to prepare a test mTLS and HTTP/2 environment here.
We have prepared a server to reproduce the problem.
Uri = #{scheme := Scheme, host := Host, path := Path} = uri_string:parse(<<"https://suzu.shiguredo.co.jp/test">>).
Hostname = binary_to_list(Host).
application:ensure_all_started(gun).
application:get_key(gun, vsn).
TlsOpts = [{versions, ['tlsv1.3', 'tlsv1.2']}, {verify, verify_none}, {server_name_indication, Hostname}].
{ok, ConnPid} = gun:open(Hostname, 443, #{transport => tls, protocols => [http2], tls_opts => TlsOpts}).
activity log
1> Uri = #{scheme := Scheme, host := Host, path := Path} = uri_string:parse(<<"https://suzu.shiguredo.co.jp/test">>).
#{scheme => <<"https">>,path => <<"/test">>,
host => <<"suzu.shiguredo.co.jp">>}
2> Hostname = binary_to_list(Host).
"suzu.shiguredo.co.jp"
3> application:ensure_all_started(gun).
{ok,[cowlib,gun]}
4> application:get_key(gun, vsn).
{ok,"2.0.1"}
5> TlsOpts = [{versions, ['tlsv1.3', 'tlsv1.2']}, {verify, verify_none}, {server_name_indication, Hostname}].
[{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,"suzu.shiguredo.co.jp"}]
6> {ok, ConnPid} = gun:open(Hostname, 443, #{transport => tls, protocols => [http2], tls_opts => TlsOpts}).
{ok,<0.192.0>}
7> =ERROR REPORT==== 24-May-2023::15:27:36.374424 ===
** State machine <0.192.0> terminating
** Last event = {internal,
{connected,
{sslsocket,
{gen_tcp,#Port<0.9>,tls_connection,undefined},
[<0.200.0>,<0.199.0>]},
http2}}
** When server state = {connected,
{state,<0.178.0>,
{up,#Ref<0.519776990.1320157185.171562>},
"suzu.shiguredo.co.jp",443,<<"https">>,
"suzu.shiguredo.co.jp",443,[],
#{protocols => [http2],
transport => tls,
tls_opts =>
[{versions,['tlsv1.3','tlsv1.2']},
{verify,verify_none},
{server_name_indication,
"suzu.shiguredo.co.jp"}]},
undefined,undefined,gun_tls,true,
{ssl,ssl_closed,ssl_error},
undefined,undefined,undefined,gun_default_event_h,
undefined}}
** Reason for termination = error:{'module could not be loaded',undefined}
** Callback modules = [gun]
** Callback mode = state_functions
** Stacktrace =
** [{gun,disconnect,2,
[{file,"/Users/voluntas/src/erlang/gun/src/gun.erl"},{line,1742}]},
{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1377}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]
=CRASH REPORT==== 24-May-2023::15:27:36.374594 ===
crasher:
initial call: gun:init/1
pid: <0.192.0>
registered_name: []
exception error: undefined function undefined:close/4
in function gun:disconnect/2 (/Users/voluntas/src/erlang/gun/src/gun.erl, line 1742)
in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 1377)
ancestors: [gun_conns_sup,gun_sup,<0.185.0>]
message_queue_len: 0
messages: []
links: [<0.187.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 6772
stack_size: 28
reductions: 26346
neighbours:
=SUPERVISOR REPORT==== 24-May-2023::15:27:36.375788 ===
supervisor: {local,gun_conns_sup}
errorContext: child_terminated
reason: {undef,[{undefined,close,
[{error,closed},
undefined,gun_default_event_h,undefined],
[]},
{gun,disconnect,2,
[{file,"/Users/voluntas/src/erlang/gun/src/gun.erl"},
{line,1742}]},
{gen_statem,loop_state_callback,11,
[{file,"gen_statem.erl"},{line,1377}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,241}]}]}
offender: [{pid,<0.192.0>},
{id,gun},
{mfargs,{gun,start_link,undefined}},
{restart_type,temporary},
{significant,false},
{shutdown,5000},
{child_type,worker}]
=NOTICE REPORT==== 24-May-2023::15:27:36.381311 ===
TLS client: In state connection received SERVER ALERT: Fatal - Bad Certificate
I can reproduce, thanks. But only if I use a timer:sleep after the connect. So I will have to do that in my tests (the tls_handshake_end event can be used to do the timer:sleep and/or call negotiated_protocols again). So now it's just a matter of finding the time to do the patch. Thanks!
Thanks for the response, too! I'm glad you were able to reproduce it. I am aware that it is a very small lace window.
We are slowly waiting for the patch to be applied :-)
Keep up the good work!
I'm close to having finished, I just need to replace the shiguredo server with something local in the tests I wrote.
I cannot test ssl:negotiated_protocol behavior in the test suite but I fixed it manually and it's probably not as much of an issue without timer:sleeps.
Done, thanks!