will/crystal-pg

Support for query pipelining?

vlazar opened this issue · 13 comments

Recently released Round 17 of the TechEmpower Framework Benchmarks has an interesting change:

https://www.techempower.com/blog/2018/10/30/framework-benchmarks-round-17/

As you review Round 17 results, you'll notice that Postgres database tests are stratified—there are groups of test implementations that seem implausibly faster than other test implementations.

The underlying cause of this is use of a Postgres protocol feature we have characterized as "query pipelining" because it is conceptually similar to HTTP pipelining. We call it pipelining, but you could also call it multiplexing. It's a feature of the "Extended Query" protocol in Postgres. Query pipelining allows database clients to send multiple queries without needing to wait for each response before sending the next. It's similar to batching but provided invisibly in the driver.

will commented

This would be fantastic. Right now the DB connection is the only thing in my crystal webapp which feels a bit slow.

Would love to work with someone on this.

While looking for documentation I found the following explanation and implementation:

Also have anyone has considered binding LibPQ instead? https://www.postgresql.org/docs/9.5/libpq.html

It seems to have solve all of these implementations already. We could bind to the library and create a Crystal-DB interface to it. Thoughts?

will commented

I think the community can benefit from a native crystal implementation too

Just posted #253 to support query pipelining, if y'all could please have a look. I've been evaluating TimescaleDB for a side project and being able to pipeline INSERT queries for datapoints would be fantastic.

@jgaskins Looks nice. Regarding your questions about the API design, perhaps it makes sense to focus on a very low level API at first. This enables using the feature, but you need to know what you're doing. Later there could be a more high level API on top of that.

For a high level API I think it could be useful to combine the query and result interpretation into a promise object. I think for this there's not even a need for wrapping this into a pipeline block and could even be interleaved with direct queries on the same connection.

query1 = connection.pipeline.query_one "SELECT 42", as: Int32
query2 = connection.pipeline.query_all "SELECT * FROM posts LIMIT $1", 10, as: Post

answer = query1.result
posts = query2.result

Behind the scenes, pipe_query creates a pipeline promise containing the query statement and a result information. As soon as the first result method is called, it triggers a pipeline commit. This sends all queries waiting in the pipeline and then parses the results, placing the values into the promise objects. The second result call returns directly because the value has already been retrieved.

@straight-shoota That was the first thing I tried. I wanted the pipeline to use the DB::QueryMethods API. For multiple reasons, it doesn’t work with the Crystal type system.

How so? I would think this should be perfectly representable in Crystal's type system.

I thought it would be, too, until I tried it and realized that returning objects of a specific type is easy but storing them in memory for processing later is not.

Feel free to try it out if you're curious, but it's a moot point anyway. That's not how query pipelines are used in practice, and interleaving pipelined vs immediate queries like you suggested would increase cognitive load and/or would be prone to bugs. For example, consider this:

posts = pg.pipeline.query_one "...", as: Post
comments = pg.pipeline.query_all "...", as: Comment
author = pg.query_one "...", as: User
reactions = pg.pipeline.query_all "...", as: Reaction

posts = posts.result
comments = comments.result
reactions = reactions.result

If you send the first two queries before the third, this code breaks. If you don't send the first two queries before the third, it'd be very confusing when debugging as to why the queries are appearing out of order. The pipeline block in #253 communicates a clear start and end point of the pipelined queries to anyone reading the code.

Even if the queries are not being interleaved like that, the structure provided by the block is still better than without it:

posts = pg.pipeline.query "..."
comments = pg.pipeline.query "..."
author = pg.pipeline.query "..."
reactions = pg.pipeline.query "..."

posts = posts.read_one(Post)
comments = comments.read_all(Comment)
author = author.read_one(User)
reactions = reactions.read_all(Reaction)

# vs

rs = pg.pipeline do |pipe|
  pipe.query "..."
  pipe.query "..."
  pipe.query "..."
  pipe.query "..."
end

posts = rs.read_one(Post)
comments = rs.read_all(Comment)
author = rs.read_one(User)
reactions = rs.read_all(Reaction)

The only difference between these two is promise vs block and the second is still significantly easier to grok.

My concern about a structure where query and reading the result are completely separate is that you have to make sure to use the same order in both. But this probably doesn't matter too much. Or at least it's already great to have a functioning lower level API. For such an advanced feature this might actually be good enough. You'll just have to make sure to use it properly.

My concern about a structure where query and reading the result are completely separate is that you have to make sure to use the same order in both.

I hear you on this, but a query pipeline is effectively a channel, so this makes sense. That you're thinking about the problem in a different way is not a bad thing, but while a bespoke async I/O implementation with promises can be implemented on top of a pipeline, it is not itself a pipeline.

For such an advanced feature this might actually be good enough

Agreed that it's pretty advanced, and will likely rarely be used. I imagine most folks are using ORMs for most DB interactions and will only dip into an API like this at all when they really need the performance.