dashbitco/broadway_cloud_pub_sub

Solution for GCP PubSub routine 502's, again

SophisticaSean opened this issue · 7 comments

This PR did not fix this issue, nor does it attempt to retry 502's from google.

I just tested this in production against latest master/main in this repo and the 502's started showing up again.

503's are not retried, but should be:

Unable to fetch events from Cloud Pub/Sub. Reason: 
Request to "http://pubsub.googleapis.com/v1/projects/my-project/subscriptions/my-topic:pull" failed with status 503, got:

"upstream connect error or disconnect/reset before headers. reset reason: connection termination"

502's are not retried, but should be:

Unable to fetch events from Cloud Pub/Sub. Reason: 
Request to "http://pubsub.googleapis.com/v1/projects/my-project/subscriptions/my_topic:pull" failed with status 502, got:

"<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content=\"initial-scale=1, minimum-scale=1, width=device-width\">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
"

My solution in this PR has been working for several months in production with 0 503 or 502 errors.

Running the current master code in production resulted in about 1-2 of these errors every couple hours. When I was on my fix in production we hadn't had one of these errors since March 3rd 2021.

@wronfim since you wrote the PR I thought you should know.

here's a datadog screenshot of these errors in the last month, 0 before I swapped to using master and now we're seeing them routinely:
image

@SophisticaSean thank you for taking the time to test the master branch and reporting the issue.

I was able to reproduce this with an integration test:

@@ -118,6 +118,9 @@ defmodule BroadwayCloudPubSub.GoogleApiClient do
   def acknowledge(ack_ids, opts) do
     opts
     |> conn!()
+    |> Map.update!(:pre, fn pre ->
+      [{Tesla.Middleware.BaseUrl, :call, ["http://localhost:9999"]} | pre]
+    end)
     |> pubsub_projects_subscriptions_acknowledge(
       opts.subscription.projects_id,
       opts.subscription.subscriptions_id,
diff --git a/mix.exs b/mix.exs
index f874e93..87ba60f 100644
--- a/mix.exs
+++ b/mix.exs
@@ -36,7 +36,8 @@ defmodule BroadwayCloudPubSub.MixProject do
       {:google_api_pub_sub, "~> 0.11"},
       {:hackney, "~> 1.6"},
       {:goth, "~> 1.0", optional: true},
-      {:ex_doc, "~> 0.23", only: :docs}
+      {:ex_doc, "~> 0.23", only: :docs},
+      {:bypass, "~> 2.1", only: :test}
     ]
   end
 
diff --git a/test/broadway_cloud_pub_sub/google_api_client_test.exs b/test/broadway_cloud_pub_sub/google_api_client_test.exs
index 4128d74..f048e66 100644
--- a/test/broadway_cloud_pub_sub/google_api_client_test.exs
+++ b/test/broadway_cloud_pub_sub/google_api_client_test.exs
@@ -340,16 +340,19 @@ defmodule BroadwayCloudPubSub.GoogleApiClientTest do
       test_pid = self()
       {:ok, counter} = Agent.start_link(fn -> 0 end)
 
-      Tesla.Mock.mock(fn %{method: :post} ->
+      bypass = Bypass.open(port: 9999)
+
+      Bypass.expect(bypass, fn conn ->
         if Agent.get_and_update(counter, &{&1, &1 + 1}) < 3 do
           send(test_pid, :pong)
-          {:error, %Tesla.Env{status: 503}}
+          Plug.Conn.send_resp(conn, 503, "oops")
         else
-          {:ok, %Tesla.Env{status: 200, body: "{}"}}
+          Plug.Conn.send_resp(conn, 200, "{}")
         end
       end)
 
-      opts = Keyword.put(base_opts, :retry, max_retries: 3)
+      opts = Keyword.put(base_opts, :__internal_tesla_adapter__, Tesla.Adapter.Httpc)
+      opts = Keyword.put(opts, :retry, max_retries: 3)
       {:ok, opts} = GoogleApiClient.init(opts)
       assert GoogleApiClient.acknowledge(["1", "2"], opts) == :ok

and you're totally right, the retry logic was not executed for http errors. It would only be triggered for socket errors like :nxdomain and such.

The fix is simply:

diff --git a/lib/broadway_cloud_pub_sub/google_api_client.ex b/lib/broadway_cloud_pub_sub/google_api_client.ex
index d86aa1d..8784e68 100644
--- a/lib/broadway_cloud_pub_sub/google_api_client.ex
+++ b/lib/broadway_cloud_pub_sub/google_api_client.ex
@@ -55,7 +55,7 @@ defmodule BroadwayCloudPubSub.GoogleApiClient do
     %{client | adapter: adapter, pre: client.pre ++ pre}
   end
 
-  defp should_retry?({:error, %{status: code}}), do: code in @retry_codes
+  defp should_retry?({:ok, %{status: code}}), do: code in @retry_codes
   defp should_retry?({:error, _reason}), do: true
   defp should_retry?(_other), do: false

@wronfim if you, or anyone else, would like to submit the fix along with the proper test setup with bypass (which doesn't require hardcoding the test url like I did) that'd be very appreciated.

Good catch! I'm on it.

@SophisticaSean meanwhile you can implement your own should_retry?/1 function and pass it to the retry opts.

https://github.com/dashbitco/broadway_cloud_pub_sub/blob/master/lib/broadway_cloud_pub_sub/google_api_client.ex#L84-L87

  defp should_retry?({:error, %{status: code}}), do: code in @retry_codes
  defp should_retry?({:ok, %{status: code}}), do: code in @retry_codes
  defp should_retry?({:error, _reason}), do: true
  defp should_retry?(_other), do: false

Thanks for getting to this so quickly! I'm excited to try the new changes soon. :)

Closing in favour of PR #60