fatkodima/sidekiq-iteration

Using (id = x) or (id < x) is slow compared to (id <= x)

Closed this issue · 17 comments

Hi there!

We have a very large table (over 2,000,000,000 rows) and check that:

# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE ((id = 278548164) OR (id < 278548164)) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..398.87 rows=10000 width=8) (actual time=39187.250..39503.535 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9798836.89 rows=246018089 width=8) (actual time=39187.249..39502.581 rows=10000 loops=1)
         Filter: ((id = 278548164) OR (id < 278548164))
         Rows Removed by Filter: 2130806
         Heap Fetches: 425110
 Planning Time: 0.108 ms
 Execution Time: 39504.187 ms
(7 rows)

# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE ((id = 278548164) OR (id < 278548164)) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..405.30 rows=10000 width=8) (actual time=46282.322..47514.757 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9902842.46 rows=244675452 width=8) (actual time=46282.321..47513.382 rows=10000 loops=1)
         Filter: ((id = 278548164) OR (id < 278548164))
         Rows Removed by Filter: 2132216
         Heap Fetches: 492759
 Planning Time: 0.707 ms
 Execution Time: 47515.678 ms
(7 rows)

Although, if I use <= instead, it works quite fast:

Jason Wang's team/goco-api-edge/d9916ejm3b0csg=# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE (id <= 278548164) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..385.22 rows=10000 width=8) (actual time=0.011..5.137 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9417311.25 rows=244829902 width=8) (actual time=0.010..4.455 rows=10000 loops=1)
         Index Cond: (id <= 278548164)
         Heap Fetches: 3530
 Planning Time: 0.111 ms
 Execution Time: 5.579 ms
(6 rows)

Jason Wang's team/goco-api-edge/d9916ejm3b0csg=# explain analyze SELECT "paper_trail"."versions"."id" FROM "paper_trail"."versions" WHERE (id <= 278548164) ORDER BY "paper_trail"."versions"."id" DESC LIMIT 10000;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..385.22 rows=10000 width=8) (actual time=0.010..5.532 rows=10000 loops=1)
   ->  Index Only Scan Backward using versions_pkey on versions  (cost=0.57..9417311.25 rows=244829902 width=8) (actual time=0.009..4.848 rows=10000 loops=1)
         Index Cond: (id <= 278548164)
         Heap Fetches: 3530
 Planning Time: 0.083 ms
 Execution Time: 5.959 ms
(6 rows)

The code calling iteration is quite simple:

    def build_enumerator(cursor:)
      active_record_relations_enumerator(
        MyModel,
        cursor:,
        batch_size: 10_000,
        order: :desc,
        columns: [:id]
      )
    end

Why we are doing this = OR < instead of <=?

Forgot to mention, Postgres 15.4.

Current workaround:

def build_enumerator(cursor:)
       Enumerator.new do |yielder|
         MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation|
           yielder.yield(relation, relation.minimum(:id))
         end
       end
     end

Thank you for such good reports!
This and the other issue are already fixed on master. Can you verify it works and so I will release a new version soon?

@fatkodima is there a reason to have a custom implementation like this instead of using the Rails' find_each/in_batches/find_in_batches?

The main reason was that rails batches do not support iterating over non-primary key columns.

Maybe a good idea to add support iterating over custom columns to rails itself.

Screenshot 2024-05-10 at 11 12 49

Definitely working better than Rails' find_each, you can see the drop on iowait, it drops exactly after we deployed the master version.

Nice! Please, confirm that another bug is also fixed and I will release a new version then.

Screenshot 2024-05-10 at 12 31 01

CPU usage is high again, not sure why, maybe it's expected?

The other bug you mean the explicit primary column? If so, we removed it and it worked as expected now.

At this point I think there's something wrong with our database, I'm checking with our provider, the gem itself seems okay now.

Released a new version. Thanks again for the reports!

We have a huge table and neither find_each or sidekiq-iteration implementation was iterating good enough (investigation still going with the Postgres' provider) but for reference, we ended doing this:

    def build_enumerator(cursor:)
      cursor ||= MyModel.connection.select_value(Arel::Table.new(MyModel.sequence_name).project("last_value"))

      Enumerator.new do |yielder|
        cursor.step(1, -BATCH_SIZE) do |max|
          min = [max - BATCH_SIZE + 1, 1].max

          yielder.yield MyModel.where(id: min..max), min
        end
      end
    end

We are iterating backwards but you could do forwards if needed and we are using id BETWEEN [min] AND [max] to achieve a better performance.

The gotcha is that each_iteration might get called with no records or less records than BATCH_SIZE but we don't care about that in our use case.

You can probably use use_ranges: true from rails' batch iteration and it should be the fastest and less custom code.

I will probably add that option to the gem too.

When we use like this, the use_ranges defaults to true and the performance bottleneck still happens.

MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation|
  yielder.yield(relation, relation.minimum(:id))
end

Our provider is investigating why is that but using the BETWEEN worked the best so far.

Yeah, BETWEEN is the simplest and fastest. The bottleneck can be because of the relation.minimum(:id) call per iteration and also rails/rails#51243 may be relevant.

@fatkodima just to drop a note, we discussed among other places of pluck_in_batches and etc.

For Rails, maybe we could have a object like this:

MyModel.in_batches(of: 10_000, order: :desc, start: cursor) do |relation, batch_meta_object|
  batch_meta_object.ids # ids from memory
  batch_meta_object.min_id # ids from memory
  batch_meta_object.max_id # ids from memory
end

But it's out of scope for this issue here. Let's consider this done at this point.