toluaina/pgsync

Redis bulk issue

erry-takumi opened this issue · 2 comments

PGSync version:
2.5.0

Postgres version:
PostgreSQL 13.10 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6), 64-bit (AWS RDS)

Elasticsearch version: 6.8.0

Redis version:
6.2.4

Python version: python:3.8.16-alpine3.17

Problem Description:

When the program processes items, it starts by processing the number of items specified by REDIS_READ_CHUNK_SIZE. However, there's an issue. If there are more items left in the queue after this initial processing, the program should continue processing in chunks of REDIS_READ_CHUNK_SIZE until the queue is empty. But it doesn't do this. Instead, it waits until the queue size grows by an additional REDIS_WRITE_CHUNK_SIZE before processing again. At that point, it will still process only the newest 1000, then go back to sleep.

In simpler terms: The program should keep processing items in chunks until the queue is empty. But, if there are items left after the first chunk, it waits for even more items to be added before it processes again.

We have some cronjobs that can edit a lot of records in 1 go, so there can be thousands of changes all at once. This resulted in our redis queue exceeding 120K records.

However, the process did not fully catch up, and instead did 1000 records even if there were 10000 changes, then just waited for AN ADDITIONAL 1000 changes.

I’ve addressed this by setting REDIS_READ_CHUNK_SIZE to a large enough value that it consumed the whole queue at once, which didn’t seem to be a problem in terms of system resources, but I think now we can set it back. As for REDIS_READ_CHUNK_SIZE i’ve set it to 1 for now.

For a visual example:

  • Queue is 0
  • Do a bulk transaction that adds 100k records to the database
  • Observe that only the last 1000 of them are processed, the rest wait for the next batch of 1000 records to be added and in the mean time hang in the queue!

We also see messages like this:

 Xlog: [132] => Db: [19,084] => Redis: [total = 9,000 pending = 116,532] => Elasticsearch: [2,950,833]...

These messages keep repeating. Redis queue will not get smaller until there are another 1000 records in the database, and it cannot catch up.

Basically steps to reproduce:

  • Leave the default write and read size for redis
  • Get your redis queue filled up with considerably more than REDIS_WRITE_CHUNK_SIZE records. Ours had about 120K records.
  • I guess this happens when there are a lot of inserts/updates to postgres in bulk?
  • Start the process, 1000 records are processed.
  • Observe that the queue still has 119k records waiting.
  • There is nothing pulling from the queue until another 1000 changes are made
  • If another 1000 updates happen, another 1000 records will be processed, but the 119k 'debt' will still be there. If we are making big bulk updates with more than 1000 operations at a time, then we will never catch up to the debt.

I hope this makes sense, let me know if you want some script to reproduce as I am fairly certain of how it can happen

Error Message (if any):

None but pgsync 'falls behind' the db

Thanks for the very detailed analysis. This gives me enough to understand what is going on.
I will take a look at this right away

Same thing happening to us!

@erry-takumi thank you for the writeup and helpful temp solve