Suggestion for handling write-heavy tables

Question

Suggestion for handling write-heavy tables

jberkus opened this issue 7 years ago · 7 comments

You can add support for omitting write-heavy tables automatically if you just take a simplified ratio approach:

Have the user supply a ratio, with a default of "3.0:1.0"
Compare the total number of read to the number of writes on the table using pg_stat_user_tables.
If reads:writes is less than the ratio, do not index.

Yes, it would be ideal to use some cool heuristic which compared the magnitude of the speed-up in the read queries to the estimated cost in the write queries, but that would be really hard to develop. Whereas the ratio approach would be easy, and cover 70% of cases.

(BTW, here's an outline of the more sophisticated approach: you calculate out the total estimated time saved by the read query improvement based on the logs. Then you estimate extra time added to write queries based on an estimated "write cost" (complicated, but doable; needs to factor in the number of columns being indexed, whether those had an index before, size of data types, etc.). Then you compare both, and add the index if the savings on reads is 1.5X the write cost or greater.)

Answer 1 · 2017-07-28T04:16:33.000Z

Hey Josh, honored to have your input on a project. Huge thanks for your contributions to Postgres!

I really like the simplicity of this idea. For the case when stats are reset, we can probably wait for a period of time afterwards. I'll do some investigation into practical examples to see how things look.

And while I have your attention: mind if I include your index bloat query in PgHero? It's really helped us at Instacart.

Answer 2 · 2017-08-01T00:58:32.000Z

Go for it, that's why I made it a public GiST. Check for the potential improvements in the comments, though.

Answer 3 · 2017-12-21T17:47:00.000Z

I'm not sure this feature would be helpful. It is easy enough to figure out the difference between reads and writes and say "Therefore, I don't want to index this table".

But that's not nearly the only factor. It's not all about total overall query time (writes + reads), but things like page response time. You may still want really fast read queries even with tons of writes.

Looking at performance at our company both in OLTP and reporting needs, this idea hasn't ever been really significant for us. It's more significant which read queries people are running and how often, and their needs. Usually very write-heavy tables are segregated, and not even read from.

Just FWIW.

Answer 4 · 2017-12-21T19:06:08.000Z

@jfinzel workloads vary. For some users write-heavy tables are segregated. But there's a lot of databases out there which consist, essentially, of One Big Table which is both write and read heavy. Index bloat and the resulting vacuum activity can be a large burden on such databases, so simply ignoring the write activity doesn't help those users.

Answer 5 · 2017-12-21T19:12:57.000Z

I'm not suggesting ignoring write activity, but I am suggesting that I hesitate to ignore read activity. For situations like you describe, I think the right solution is to be very selective about which indexes you build, but to still support all read queries that you have.

If you really don't need super fast read response time, things like BRIN indexes can help you solve both problems perhaps better.

I'm just not sure this is a problem that this tool needs to solve.

Answer 6 · 2017-12-21T19:19:33.000Z

Well, AFAIK, nobody has written any code to this suggestion since I made it, so this is a rather academic discussion.

However, as I understand it, the goal of Dexter is to automate indexing for 80% of users (or 80% of indexes). Ignoring write activity, or making all high-write tables off-limits, doesn't fulfill that goal.

Answer 7 · 2017-12-21T19:47:33.000Z

However, this does lead me to think that there is a 2nd heuristic which could be applied without being more complicated.

for each query we already know the difference in read cost per HypoPG.
compare this difference in cost against the following formula:

rc1 - rc > F * ( wt/rt )

Where:

rc1 = read cost with hypo index
rc = read cost without hypo index
F = write factor, a user-supplied multiple, defaults to 2
wt = table write count (overall)
rt = table read count (overall)

This means that very beneficial indexes would be added, even on high-write tables, but marginally beneficial indexes would not be.