GoogleCloudPlatform/cloud-sql-go-connector

Connection failure due to Cloud SQL Admin API per user/minute quota

jweckschmied opened this issue · 18 comments

Hi,

we're experiencing an issue with the go connector that I just can't make sense of.
At some point, we hit the Cloud SQL Admin API per user/minute quota (in our case 180) for no apparent reason. Of course, once the quota is reached, refresh attempts do not succeed and no connection to cloudsql can be established, because the calls to google.cloud.sql.v1beta4.SqlConnectService.GenerateEphemeralCert and google.cloud.sql.v1beta4.SqlConnectService.GetConnectSettings fail with a 429 error. The only way to resolve this seems to be to kill the deployment for at least 1 minute, so that the quota resets. Then everything is fine again, but who know for how long.
We use IAM authentication and authorization. The service account is only used for one service, so the api calls all originate from that one specific service using the cloud-sql-go-connector.

Has anyone else experienced this kind of behavior?

Thanks!

Thanks for the question @jweckschmied.

So two things:

  1. If you're hitting Admin API quota issues, that might be a reason to take a step back and look at how you're using the Go connector. Are you using connection pooling for instance?
  2. As for the actual problem, we do return the underlying API error. So you could check in your code if the error is a RefreshError and if it's a 429 status code with errors.As.

We could make this easier with a concrete error type when a 429 occurs, possibly. I'd prefer to rely on our generated API client for those concrete errors, though.

Thanks for the quick answer @enocom

We do connection pooling, they way we connect is pretty much exactly the same as sample code from this guide: https://cloud.google.com/sql/docs/postgres/samples/cloud-sql-postgres-databasesql-connect-connector
That's why I can't understand why we're hitting the quota. With about 8 - 10 connections on average, even if I assume a higher than normal refresh rate we shouldn't even be close to hitting the quota of 180/min.

As far as the errors go, I probably wasn't clear enough there. We do log the returned errors from the connector, and they are 429 status code errors (failed to get instance: Refresh error: failed to get instance metadata [...] Quota exceeded for quota metric 'Queries' and limit 'Queries per minute per user' of service 'sqladmin.googleapis.com')

A few follow up questions:

  1. How many instances are you connecting to in your app?
  2. What version of the Go connector are you using?
  3. If you're using GKE, do you see any pods in a crash backoff loop or similar?
  4. What kind of CPU usage do you see when your app hits the quota?
  1. Just one database instance.
  2. We're using v1.1.0 (but the issue already appeared on v1.0.1)
  3. Yes, pods were in a CrashLoopBackoff when they could not connect. Killing one pod after the other has fixed the quota usage issue, but I'm worried it'll happen again because we don't really know the root cause.
  4. CPU usage went up significantly when we hit the quota (about a 7x increase).

The API usage seems to steadily rise for a while until it hits the quota, without a clear trigger/cause.

This sounds like a manifestation of #370.

Have you seen the issue on v1.1.0?

If so, would you mind providing a minimal reproduction?

I have hit exactly the same problem, and then I sort it out myself. The cause of that may be different, but I wanna share it here:

An observation I have is that the sample code given by https://cloud.google.com/sql/docs/postgres/samples/cloud-sql-postgres-databasesql-connect-connector (GCP doc) and the doc in https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/blob/main/README.md (Github) is different.

GCP version:

        config.DialFunc = func(ctx context.Context, network, instance string) (net.Conn, error) {
                if dbIAMUser != "" {
                        d, err := cloudsqlconn.NewDialer(ctx, cloudsqlconn.WithIAMAuthN())
                        if err != nil {
                                return nil, err
                        }
                        return d.Dial(ctx, instanceConnectionName)
                }

The GCP version is wrong, because it calls SQL Admin API every time a connection is needed. The API is called before the refresh token cache is looked at.

Github version

	config, err := pgxpool.ParseConfig(dsn)
	if err != nil {
		/* handle error */
	}

	// Create a new dialer with any options
	d, err := cloudsqlconn.NewDialer(context.Background())
	if err != nil {
		/* handle error */
	}
	defer d.Close()

	// Tell the driver to use the Cloud SQL Go Connector to create connections
	config.ConnConfig.DialFunc = func(ctx context.Context, _ string, instance string) (net.Conn, error) {
		return d.Dial(ctx, "project:region:instance")
	}

The Github version is right, but it missed the fact that use of pgxpool is not recommended by many people (as db/sql already did that pooling). Thus the example in Github is constantly ignored.

And adding onto this problem, I also believe that SQL Admin API is actually throttling using a per second rate rather than per minute rate. This makes the problem even more prevailing.

An ask here: Could I request for GCP to fix the documentation?

Thanks @yukinying. You're right -- the docs example is wrong. We shouldn't be creating a new dialer inside the DialFunc.

I'll work on getting this fixed in our GCP docs.

Meanwhile, I'd suggest following the example in the README here.

And adding onto this problem, I also believe that SQL Admin API is actually throttling using a per second rate rather than per minute rate. This makes the problem even more prevailing.

What makes you think it's using a per second rate limit?

What makes you think it's using a per second rate limit

My experience is that I got throttled in the first couple seconds (and my hit rate never reach the per minute quota) with the code shown in GCP docs. I am able to see that via the API dashboard.

following the example in the README here

I would suggest making the README covering the example of using pgx so that people are not limited to the option of using pgxpool. In particular, when I am seeing pgxpool get referred, the first reaction I have is to search Google and see what is the other example that use pgx only, and then the GCP doc is the first result showing up. That's the path that you may want to avoid people taking because the project docs would always be most up-to-dated I think?

Sorry for hijacking this thread. Here is the evidence that the throttling is done by rate per second. Background: my project has a quota 1200 req/min (as I requested for the increase). In theory, if we never get anything further than 20 req/s, we should never be throttled.

Now the interesting observation is that, my project start getting 429 once I reach a rate of 15 req/s. And if you look at the graph, we can infer that we have never reached 1200 req/min, we are far below that.

image

@yukinying Would you mind opening a new issue about the rate limiting? I'm happy to at least see what I can find from the backend team about this.

I created #415. Thank you.

Thanks everyone! I really appreciate the help :)

Fixed the sample problem in GoogleCloudPlatform/golang-samples#2791.

We probably can make our README more clear.

In general, the recommendation is:

Just to add on to this: When using the database/sql example, we've noticed that calling the returned cleanup function from RegisterDriver seems to prevent refreshes of the ephemeral certificates for the connection, resulting in a tls: bad certificate error. Sorry for hijacking this issue, but I figured it might also just be a docs issue.

You’ll want to call cleanup when you’re done with the connector because it does stop all background refreshes.

Updated the README to make all this clear.