Don't delay errors when GOAWAY has been received

Question

Don't delay errors when GOAWAY has been received

zuiderkwast opened this issue 4 years ago · 7 comments

Scenario: On a connection with constant high load, the server sends a GOAWAY
frame. Then, before all the streams have completed, more requests are sent to
the Gun process by the application.

Expected behaviour: The gun_error needs to be sent instantly, so that the
application can retry the request on a different connection (typically to a
different machine in a cloud setup of some sort).

Actual behaviour: The errors are not sent until all streams are completed.
Additionally, two unexpected extra messages gun_down and gun_up are sent
when the streams have completed.

The test case below illustrates the current actual behavior:

new_requests_fail_after_receiving_goaway(_) ->
	doc("Check that Gun sends gun_error immediately for requests queued "
		"after receiving a GOAWAY frame. (RFC7540 x.x.x)"),
	MainPid = self(),
	{ok, OriginPid, Port} = init_origin(tcp, http2, fun(_, Socket, Transport) ->
		%% Step 2. Receive a HEADERS frame.
		{ok, <<Len1:24, 1:8, _:8, 1:32>>} = Transport:recv(Socket, 9, 1000),
		{ok, _SkippedHeader} = gen_tcp:recv(Socket, Len1, 1000),

		%% Step 3. Send GOAWAY with last stream ID = 1.
		ok = Transport:send(Socket, cow_http2:goaway(1, no_error, <<>>)),

		%% Step 5. Receive GOAWAY response from Gun.
		{ok, <<Len2:24, 7:8, _:8, 0:32>>} = Transport:recv(Socket, 9, 1000),
		{ok, _SkippedPayload} = gen_tcp:recv(Socket, Len2, 1000),

		%% Wait for step 6...
		timer:sleep(600),

		%% Step 7. Send a response for stream 1.
		{HeadersBlock, _} = cow_hpack:encode([
			{<<":status">>, <<"200">>}
		]),
		ok = Transport:send(Socket, [
			cow_http2:headers(1, nofin, HeadersBlock)
		]),
		ok = Transport:send(Socket, [
			cow_http2:data(1, fin, <<"hello">>)
		]),

		%% Check that Gun didn't send the second request.
		case Transport:recv(Socket, 9, 100) of
			{error, timeout} -> ok;
			{error, closed} -> ok
		end,
		timer:sleep(100),
		Transport:close(Socket)
	end),
	{ok, ConnPid} = gun:open("localhost", Port, #{
		protocols => [http2]
	}),
	{ok, http2} = gun:await_up(ConnPid),
	handshake_completed = receive_from(OriginPid),

	%% Step 1.
	StreamRef1 = gun:get(ConnPid, "/slow-req"),

	%% Step 4. Gun receives GOAWAY and responds with GOAWAY.
	timer:sleep(300),

	%% Step 6. The 2nd request is not sent to the server.
	StreamRef2 = gun:get(ConnPid, "/after-goaway"),

	%% Step 8.

        %% ---------------------------------------------------------------------
	%% Expected behaviour: Error for the 2nd request immediately, i.e.
	%% before receiving the response for the 1st request.
        %% ---------------------------------------------------------------------

        %% ---------------------------------------------------------------------
	%% Actual behaviour: We get response for the 1st request first.
	{gun_response, ConnPid, StreamRef1, nofin, 200, _} = receive M1 -> M1 end,
	{gun_data, ConnPid, StreamRef1, fin, <<"hello">>} = receive M2 -> M2 end,

	%% Then, the following two come in any order.
	receive {gun_error, ConnPid, StreamRef2, _closed} -> ok end,
	receive {gun_down, ConnPid, http2, closed, [StreamRef2]} -> ok end,

	%% The following two unexpected messages are also received.
	receive {gun_down, ConnPid, http2, normal, []} -> ok end,
	receive {gun_up, ConnPid, http2} -> ok end,
	%% ---------------------------------------------------------------------

	no_more_message = receive M3 -> M3 after 0 -> no_more_message end,
	gun:close(ConnPid).

This scenario is expected to be common whenever a remote server is being taken
out of service.

Apart from getting the gun_error as soon as possible, before the other streams
have completed, it would also be useful to be informed that the connection is
going down, so that the application can avoid sending more requests to it, or
perhaps it would be enough to indicate this in the Reason element of the
gun_error tuple.

Answer 1 · 2020-10-20T20:11:46.000Z

The gun_up isn't unexpected because you let Gun reconnect automatically (it'll do that by default). The second gun_down is not normal though.

In general it's OK to send more requests, they should be postponed until Gun has reconnected. But there's probably lots of corner case issues.

Answer 2 · 2020-10-20T20:16:44.000Z

Ah... So, disabling automatic reconnect will cause the errors to be returned immediately? I have to check....

Answer 3 · 2020-10-20T20:25:45.000Z

Not what I meant, this probably still is an issue. Just responding to the other points at the end of your report.

Answer 4 · 2020-10-20T20:34:37.000Z

Right. Anyway, the important point for us is that we can retry the request as soon as possible on a different connection, so I guess we must use retry => 0. Then, with the same scenario, I only get response headers and body for the first request and then a {gun_down, ConnPid, http2, normal, []}. No error for the 2nd request is sent.

	%% Actual behaviour: We get response for the 1st request first.
	{gun_response, ConnPid, StreamRef1, nofin, 200, _} = receive M1 -> M1 end,
	{gun_data, ConnPid, StreamRef1, fin, <<"hello">>} = receive M2 -> M2 end,

	%% Then, a single gun_down message.
	{gun_down, ConnPid, http2, normal, []} = receive M3 -> M3 end,

	%% No error is sent for the 2nd request.
	no_more_message = receive M4 -> M4 after 0 -> no_more_message end,

Answer 5 · 2020-10-20T22:05:16.000Z

So I suppose in that case the problem is that the request is postponed despite the fact that it won't be handled, when retry is disabled we could produce errors instead of postponing.

Answer 6 · 2020-10-21T15:16:20.000Z

Exactly. Here is a PR: #252.

Answer 7 · 2020-11-12T13:29:54.000Z

Merged, thanks!