Slowness in simple queries

Question

Slowness in simple queries

mauricioribeiro opened this issue 8 months ago · 2 comments

Hi everyone,

I'm getting slow responses (30-100ms for simple select queries) from Spanner in my FastAPI app. In order to confirm the latency is not an issue, I deployed a Postgres instance (Enterprise Edition, 4 vCPUs 26 GB MEM) in the same location (europe-west1 Belgium) as the Spanner (2 nodes) and tested/profiled both in my local app. Here is the comparison:

Postgres (6ms/10 profiles avg):

Spanner (28ms/9 profiles avg):

I can confirm the Spanner query is executed in less than 1ms:

Bottleneck seems to happen in the execute() function, from what I understood analysing the Google Profiling comparison.

Also got similar slowness with the tutorial case.

Could someone please help me on this? Have anyone already experienced something similar?

Answer 1 · 2024-06-28T13:57:04.000Z

Hi @mauricioribeiro , are you still running into this?

Digging into the traces above:

I see that the underlying execute_sql call (which is highlighted in the second image above) takes around 1.1 milliseconds. This is the RPC that is actually executing the query on Spanner's backend.

To explain why this is higher than PostgreSQL (and why it's higher than the reported execution time in Spanner): Latency with Spanner is generally a little higher than with a dedicated PostgreSQL database. This is a side effect of Spanner's design decision to be, first and foremost, a distributed database. Tracing the network path to your PostgreSQL instance is quite a reasonable thing to do -- your PG instance is a single computer (or VM) in a single datacenter somewhere, with a single IP address associated with it, etc. But even the smallest 100PU Spanner database has active-active replication between at least three separate zones, any one of which is equally equipped to service queries. And many Spanner databases have hardware around the continent or around the world. Which of those machines is best equipped to answer your query right now? In practice, this means more routing logic and more network hops, which adds latency. The reported execution time above is just time spent actually running the query inside of the Spanner node in question. It omits routing overhead, which is important for performance-tuning, but which for example is generally irrelevant to instance sizing.

Now, what about the remaining 20-odd milliseconds?

Is your test case establishing the initial connection and session to the database before, or during, the 30-100ms window? PostgreSQL uses an eagerly-established raw TCP connection -- it's relatively lightweight, but it runs into scalability problems if you want to manage hundreds or thousands or more sessions from a single client. Customers often end up implementing connection pooling, which adds overhead. Spanner uses gRPC, and Spanner's client has a gRPC connection pool (and gRPC multiplexing, and a Spanner session pool) built in. So with Spanner, that one-time setup cost is hidden inside our client. However, all of this is handled by a mostly-C++ library in the background, so the first few issued statements may appear slow while the pool is spinning up.
How many times are you running the query? As compared to PostgreSQL, Spanner's query planner is fairly expensive. However, it performs query-plan caching (based on the raw query string). I would strongly recommend using prepared statements with Spanner for this reason, for any complex code in production. I'm not certain which query from the tutorial you're trying out, but the first SELECT statement in the tutorial does not use prepared statements because they're a little more complex and it's trying to make the first getting-started example as simple as possible. Moreover, for any benchmarking use case, I'd consider running the query a bunch of times in isolation in order to warm up Spanner's caches (such as the query-plan cache) in all involved data centers.
I'm not totally clear based on the traces above -- how much time is spent actually waiting on Spanner, vs in CPU on the client side? Python is not the fastest language :-) I've frequently seen, with a variety of different DBMS's, a relatively small client-side code change cause the Python interpreter to spend much more CPU on the client side reading the result of a query over the network, than the server spent executing the whole query. Spanner's Client has more client-side logic than those of many other database engines (for various reasons partly including those discussed above) so we have occasionally seen folks running into issues in this space. We also see this in ORMs or other third-party code (which we unfortunately don't always control), where they don't recognize Spanner so use some default slower fallback implementation. If you have a specific example where this is the problem, we'd be interested in seeing more details.

Answer 2 · 2024-08-10T00:28:10.000Z

No updates here in a while -- I'm going to go ahead and close this ticket. Please re-open, or open a new ticket referencing this one, if you're still running into this or you see a similar issue.