Using (id = x) or (id < x) is slow compared to (id <= x)

Question

Using (id = x) or (id < x) is slow compared to (id <= x)

Closed this issue 4 months ago · 17 comments

Hi there!

We have a very large table (over 2,000,000,000 rows) and check that:

# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE ((id = 278548164) OR (id < 278548164)) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..398.87 rows=10000 width=8) (actual time=39187.250..39503.535 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9798836.89 rows=246018089 width=8) (actual time=39187.249..39502.581 rows=10000 loops=1)
         Filter: ((id = 278548164) OR (id < 278548164))
         Rows Removed by Filter: 2130806
         Heap Fetches: 425110
 Planning Time: 0.108 ms
 Execution Time: 39504.187 ms
(7 rows)

# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE ((id = 278548164) OR (id < 278548164)) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..405.30 rows=10000 width=8) (actual time=46282.322..47514.757 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9902842.46 rows=244675452 width=8) (actual time=46282.321..47513.382 rows=10000 loops=1)
         Filter: ((id = 278548164) OR (id < 278548164))
         Rows Removed by Filter: 2132216
         Heap Fetches: 492759
 Planning Time: 0.707 ms
 Execution Time: 47515.678 ms
(7 rows)

Although, if I use <= instead, it works quite fast:

Jason Wang's team/goco-api-edge/d9916ejm3b0csg=# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE (id <= 278548164) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..385.22 rows=10000 width=8) (actual time=0.011..5.137 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9417311.25 rows=244829902 width=8) (actual time=0.010..4.455 rows=10000 loops=1)
         Index Cond: (id <= 278548164)
         Heap Fetches: 3530
 Planning Time: 0.111 ms
 Execution Time: 5.579 ms
(6 rows)

Jason Wang's team/goco-api-edge/d9916ejm3b0csg=# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE (id <= 278548164) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..385.22 rows=10000 width=8) (actual time=0.010..5.532 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9417311.25 rows=244829902 width=8) (actual time=0.009..4.848 rows=10000 loops=1)
         Index Cond: (id <= 278548164)
         Heap Fetches: 3530
 Planning Time: 0.083 ms
 Execution Time: 5.959 ms
(6 rows)

The code calling iteration is quite simple:

    def build_enumerator(cursor:)
      active_record_relations_enumerator(
        MyModel,
        cursor:,
        batch_size: 10_000,
        order: :desc,
        columns: [:id]
      )
    end

Why we are doing this = OR < instead of <=?

Answer 1 · 2024-05-07T15:22:51.000Z

Forgot to mention, Postgres 15.4.

Answer 2 · 2024-05-07T16:57:38.000Z

Current workaround:

def build_enumerator(cursor:)
       Enumerator.new do |yielder|
         MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation|
           yielder.yield(relation, relation.minimum(:id))
         end
       end
     end

Answer 3 · 2024-05-09T18:43:17.000Z

Thank you for such good reports!
This and the other issue are already fixed on master. Can you verify it works and so I will release a new version soon?

Answer 4 · 2024-05-09T20:27:49.000Z

@fatkodima is there a reason to have a custom implementation like this instead of using the Rails' find_each/in_batches/find_in_batches?

Answer 5 · 2024-05-09T20:59:46.000Z

The main reason was that rails batches do not support iterating over non-primary key columns.

Maybe a good idea to add support iterating over custom columns to rails itself.

Answer 6 · 2024-05-10T14:13:38.000Z

Definitely working better than Rails' find_each, you can see the drop on iowait, it drops exactly after we deployed the master version.

Answer 7 · 2024-05-10T14:22:28.000Z

Nice! Please, confirm that another bug is also fixed and I will release a new version then.

Answer 8 · 2024-05-10T15:31:25.000Z

CPU usage is high again, not sure why, maybe it's expected?

Answer 9 · 2024-05-10T15:32:29.000Z

The other bug you mean the explicit primary column? If so, we removed it and it worked as expected now.

Answer 10 · 2024-05-10T15:45:08.000Z

At this point I think there's something wrong with our database, I'm checking with our provider, the gem itself seems okay now.

Answer 11 · 2024-05-10T18:02:47.000Z

Released a new version. Thanks again for the reports!

Answer 12 · 2024-05-11T14:18:50.000Z

We have a huge table and neither find_each or sidekiq-iteration implementation was iterating good enough (investigation still going with the Postgres' provider) but for reference, we ended doing this:

    def build_enumerator(cursor:)
      cursor ||= MyModel.connection.select_value(Arel::Table.new(MyModel.sequence_name).project("last_value"))

      Enumerator.new do |yielder|
        cursor.step(1, -BATCH_SIZE) do |max|
          min = [max - BATCH_SIZE + 1, 1].max

          yielder.yield MyModel.where(id: min..max), min
        end
      end
    end

We are iterating backwards but you could do forwards if needed and we are using id BETWEEN [min] AND [max] to achieve a better performance.

The gotcha is that each_iteration might get called with no records or less records than BATCH_SIZE but we don't care about that in our use case.

Answer 13 · 2024-05-11T14:21:54.000Z

You can probably use use_ranges: true from rails' batch iteration and it should be the fastest and less custom code.

Answer 14 · 2024-05-11T14:22:50.000Z

I will probably add that option to the gem too.

Answer 15 · 2024-05-13T13:03:39.000Z

When we use like this, the use_ranges defaults to true and the performance bottleneck still happens.

MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation|
  yielder.yield(relation, relation.minimum(:id))
end

Our provider is investigating why is that but using the BETWEEN worked the best so far.

Answer 16 · 2024-05-13T13:18:34.000Z

Yeah, BETWEEN is the simplest and fastest. The bottleneck can be because of the relation.minimum(:id) call per iteration and also rails/rails#51243 may be relevant.

Answer 17 · 2024-05-13T17:44:24.000Z

@fatkodima just to drop a note, we discussed among other places of pluck_in_batches and etc.

For Rails, maybe we could have a object like this:

MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation, batch_meta_object|
  batch_meta_object.ids # ids from memory
  batch_meta_object.min_id # ids from memory
  batch_meta_object.max_id # ids from memory
end

But it's out of scope for this issue here. Let's consider this done at this point.