GoogleCloudPlatform/cloud-sql-go-connector

How to avoid API rate limits with a high number of connection pools?

Closed this issue ยท 9 comments

Question

When I first implemented cloudsqlconn I ran into API rate limiting issues and had to rollback my changes. I am looking for advice or tips on how to best handle the following set up:

  1. Multi tenant environment, each tenant is a database on a postgres cloud sql instance. Currently ~150 tenants, but that number is expected to keep growing.
  2. Our primary Cloud Run deployment generally has about 5 instances
  3. PGX Pool per tenant with a minimum connection count of 2

Currently, on application start, we create a new pool for each tenant that connects specifically to its database on the instance using its own specific db user and password. This means that we have 150 * 5 (*2 maybe depending on if the certs are cached) requests minimum (750 requests) when we do a new deployment. We have additional requests around the same time for one time actions on deployment, like database migrations. So we quickly blew through the 1000 requests/min api limit.

I understand that the connector has caching for the certificate information etc, but is there any way to share that across connector instances/pgxpools? I'm struggling to see how we could use this tool in our current architecture or how to change our architecture to a better configuration.

Code

No response

Additional Details

I left out any code because our usage is more or less a direct copy paste of the examples in this repo.

A few questions:

  • Can you tell me more about what kind of API rate limiting issues you're seeing?
  • Are you creating one cloudsqlconn.Dialer per process/Cloud Run instance?
  • Can you show me how you're creating pgxpool's with the dialer?

Thanks for the questions!

  • The specific error I saw was:
dial error (failed to get instance: Refresh error: failed to get instance metadata 
(connection name = "project:region:instance"): googleapi: Error 429: Quota exceeded for quota metric 
'Connect Queries' and limit 'Connect Queries per minute per user per region' of 
service 'sqladmin.googleapis.com' for consumer 'project_number:123456`
  • Currently creating one dialer per tenant/pgxpool. Would it be reasonable to share the dialer across pools, that seems like it might be the solution here if it stores the certs by instance and not DSN (since our DSNs are different)

  • Sure! Here's a slightly sanitzed verison of the code we use to instantiate our pool:

func openDB(ctx context.Context, dsn, connName string, opts ...PgxPoolOption) (*sql.DB, error) {
	// create a pgxpool config and apply options and config
	cfg, err := pgxpool.ParseConfig(dsn)
	if err != nil {
		return nil, err
	}

	for _, opt := range opts {
		opt(cfg)
	}
	
	// Create a new dialer with option for private IP
	d, err := cloudsqlconn.NewDialer(ctx, cloudsqlconn.WithDefaultDialOptions(cloudsqlconn.WithPrivateIP()))
	if err != nil {
		return nil, err
	}

	// Tell the driver to use the Cloud SQL Go Connector to create connections
	cfg.ConnConfig.DialFunc = func(ctx context.Context, _ string, instance string) (net.Conn, error) {
		return d.Dial(ctx, connName)
	}

	// instantiate the pgxpool from the config
	connPool, err := pgxpool.NewWithConfig(ctx, cfg)
	if err != nil {
		return nil, errors.Wrap(err, "call to pgx.NewConnPool failed")
	}
	
	// retrieve a *sql.DB from the pgxpool
	db := stdlib.OpenDBFromPool(connPool)
	
	return db, nil
}

This function get's called in a for loop across our tenants and makes a map:

	for _, t := range tenants {
	    connectionMap[t.ID] = openDB(ctx, t.DSN, connName)
	}

we pass use this connection map to select an appropriate connection based on tenant being accessed.

Currently creating one dialer per tenant/pgxpool. Would it be reasonable to share the dialer across pools, that seems like it might be the solution here if it stores the certs by instance and not DSN (since our DSNs are different)

Yes -- that's your best option here. The dialer caches client certificates on a per instance basis, not a per database basis. So you'll immediately solve your problem this way. You can continue to configure your pgxpools to call through to d.Dial to keep the pool per tenant design.

Ah, wow, love when it's just a super easy solution. I will give this new approach a whirl in the next couple days and come back to confirm success. Thank you!

Let me know how that goes. We can explore additional options if you hit quota issues again.

Sorry two more tangentially related questions:

  1. I noticed in another open issue someone had deferred closing the dialer, but I don't see that in the examples page. Is that something I should be doing?
  2. I also noticed in the examples sslmode is always set to disable(). Is that due to fully relying on this package to handle that?
  1. Yes, but only in your main function or similar. When you close the dialer, it stops accepting new connections and stops the background certificate refresh goroutines.
  2. Yes, exactly. Postgres drivers by default use "prefer" which means they try SSL first, and then if that fails, fall back to disable. Since the Dialer is creating the TLS connection for you, there's no reason to make the driver try an SSL connection. And so instead, it's best to just skip straight to "disable."

Live with the shared dialer and things seem to be going much more smoothly. Down to ~50 requests during service deploy (across a variety of services) from what I think peaked a bit over 5000 before.

Thanks again for the help!

Happy to hear it! ๐Ÿ‘