Solution for GCP PubSub routine 502's, again
SophisticaSean opened this issue · 7 comments
This PR did not fix this issue, nor does it attempt to retry 502's from google.
I just tested this in production against latest master/main in this repo and the 502's started showing up again.
503's are not retried, but should be:
Unable to fetch events from Cloud Pub/Sub. Reason:
Request to "http://pubsub.googleapis.com/v1/projects/my-project/subscriptions/my-topic:pull" failed with status 503, got:
"upstream connect error or disconnect/reset before headers. reset reason: connection termination"
502's are not retried, but should be:
Unable to fetch events from Cloud Pub/Sub. Reason:
Request to "http://pubsub.googleapis.com/v1/projects/my-project/subscriptions/my_topic:pull" failed with status 502, got:
"<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<meta name=viewport content=\"initial-scale=1, minimum-scale=1, width=device-width\">
<title>Error 502 (Server Error)!!1</title>
<style>
*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
</style>
<a href=//www.google.com/><span id=logo aria-label=Google></span></a>
<p><b>502.</b> <ins>That’s an error.</ins>
<p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds. <ins>That’s all we know.</ins>
"
My solution in this PR has been working for several months in production with 0 503 or 502 errors.
Running the current master code in production resulted in about 1-2 of these errors every couple hours. When I was on my fix in production we hadn't had one of these errors since March 3rd 2021.
@wronfim since you wrote the PR I thought you should know.
@SophisticaSean thank you for taking the time to test the master branch and reporting the issue.
I was able to reproduce this with an integration test:
@@ -118,6 +118,9 @@ defmodule BroadwayCloudPubSub.GoogleApiClient do
def acknowledge(ack_ids, opts) do
opts
|> conn!()
+ |> Map.update!(:pre, fn pre ->
+ [{Tesla.Middleware.BaseUrl, :call, ["http://localhost:9999"]} | pre]
+ end)
|> pubsub_projects_subscriptions_acknowledge(
opts.subscription.projects_id,
opts.subscription.subscriptions_id,
diff --git a/mix.exs b/mix.exs
index f874e93..87ba60f 100644
--- a/mix.exs
+++ b/mix.exs
@@ -36,7 +36,8 @@ defmodule BroadwayCloudPubSub.MixProject do
{:google_api_pub_sub, "~> 0.11"},
{:hackney, "~> 1.6"},
{:goth, "~> 1.0", optional: true},
- {:ex_doc, "~> 0.23", only: :docs}
+ {:ex_doc, "~> 0.23", only: :docs},
+ {:bypass, "~> 2.1", only: :test}
]
end
diff --git a/test/broadway_cloud_pub_sub/google_api_client_test.exs b/test/broadway_cloud_pub_sub/google_api_client_test.exs
index 4128d74..f048e66 100644
--- a/test/broadway_cloud_pub_sub/google_api_client_test.exs
+++ b/test/broadway_cloud_pub_sub/google_api_client_test.exs
@@ -340,16 +340,19 @@ defmodule BroadwayCloudPubSub.GoogleApiClientTest do
test_pid = self()
{:ok, counter} = Agent.start_link(fn -> 0 end)
- Tesla.Mock.mock(fn %{method: :post} ->
+ bypass = Bypass.open(port: 9999)
+
+ Bypass.expect(bypass, fn conn ->
if Agent.get_and_update(counter, &{&1, &1 + 1}) < 3 do
send(test_pid, :pong)
- {:error, %Tesla.Env{status: 503}}
+ Plug.Conn.send_resp(conn, 503, "oops")
else
- {:ok, %Tesla.Env{status: 200, body: "{}"}}
+ Plug.Conn.send_resp(conn, 200, "{}")
end
end)
- opts = Keyword.put(base_opts, :retry, max_retries: 3)
+ opts = Keyword.put(base_opts, :__internal_tesla_adapter__, Tesla.Adapter.Httpc)
+ opts = Keyword.put(opts, :retry, max_retries: 3)
{:ok, opts} = GoogleApiClient.init(opts)
assert GoogleApiClient.acknowledge(["1", "2"], opts) == :ok
and you're totally right, the retry logic was not executed for http errors. It would only be triggered for socket errors like :nxdomain and such.
The fix is simply:
diff --git a/lib/broadway_cloud_pub_sub/google_api_client.ex b/lib/broadway_cloud_pub_sub/google_api_client.ex
index d86aa1d..8784e68 100644
--- a/lib/broadway_cloud_pub_sub/google_api_client.ex
+++ b/lib/broadway_cloud_pub_sub/google_api_client.ex
@@ -55,7 +55,7 @@ defmodule BroadwayCloudPubSub.GoogleApiClient do
%{client | adapter: adapter, pre: client.pre ++ pre}
end
- defp should_retry?({:error, %{status: code}}), do: code in @retry_codes
+ defp should_retry?({:ok, %{status: code}}), do: code in @retry_codes
defp should_retry?({:error, _reason}), do: true
defp should_retry?(_other), do: false
@wronfim if you, or anyone else, would like to submit the fix along with the proper test setup with bypass (which doesn't require hardcoding the test url like I did) that'd be very appreciated.
Good catch! I'm on it.
@SophisticaSean meanwhile you can implement your own should_retry?/1
function and pass it to the retry opts.
defp should_retry?({:error, %{status: code}}), do: code in @retry_codes
defp should_retry?({:ok, %{status: code}}), do: code in @retry_codes
defp should_retry?({:error, _reason}), do: true
defp should_retry?(_other), do: false
Thanks for getting to this so quickly! I'm excited to try the new changes soon. :)
Closing in favour of PR #60