long2ice/meilisync

RDS DB replication slot not released by meilisync

iurie-bogdanovici opened this issue · 16 comments

Hello,
I configured meilisync with meilisearch and Postgres RDS DB. Everything works ok, except that the replication slot on RDS is growing continuosely. Each 5 minutes it will increase 64MB, which is about 18GB per day. If I understand correctly, meilisync should consume data from DB replication slot, and empty it, but this doesn't happen.
I configured a maximum size value for the replication slot of 3GB (max_slot_wal_keep_size). When the replication slot reached this size it changed its status from active=true to active=false and meilisync app stopped syncing.
Could you please tell if there are some additional configurations that should be made in order to solve this issue?

@long2ice Could it be because meilisync maintains its own progress instead of advancing progress position with pg_replication_slot_advance? I could not understand the reason for progress file/redis if Postgres provides this already.

@long2ice Could you please at least confirm if meilisync is compatible/has been tested with RDS Postgres?

Sorry I don't make fully test on postgres or rds postgres, I only use it in MySQL myself.

@iurie-bogdanovici,

while attempting to configure MeiliSync with MeiliSearch and a PostgreSQL RDS DB. I'm stuck on the step of enabling logical replication.

I've come to understand that to proceed, I need to enable the wal2json extension and change the wal_level parameter to logical. However, I've encountered a roadblock as RDS does not permit direct modification of the wal_level parameter within the Parameter groups.

I'm aware that the wal2json extension is supported on platforms where PostgreSQL is installed, but I'm uncertain about the process of installing this extension on RDS! and making the necessary configuration changes.

Could you please provide guidance or share your experience on how to enable the wal2json extension and adjust the wal_level parameter to logical in a PostgreSQL RDS environment?

Your insights and assistance would be greatly appreciated.

@TEAMMEDIO
You should set rds.logical_replication parameter to 1.

These are the RDS parameters that I configured for the dev environment using Terraform:
parameters = [
{
name = "rds.force_ssl"
value = "0"
apply_method = "pending-reboot"
},
{
name = "rds.logical_replication"
value = "1"
apply_method = "pending-reboot"
},
{
name = "max_replication_slots"
value = "20"
apply_method = "pending-reboot"
},
{
name = "max_wal_senders"
value = "20"
apply_method = "pending-reboot"
},
{
name = "max_slot_wal_keep_size"
value = "3000"
apply_method = "immediate"
}
]

@iurie-bogdanovici,
how did you overcome the installation of wal2json extension in RDS?

@TEAMMEDIO I used only those parameters specified in my previous message, and then I configured meilisearch and meilisync using default settings from meilisearch website.
When I run inside the db "SELECT * FROM pg_replication_slots" command I get in the column plugin, value "wal2json". So I suppose wal2json plugin is enabled by default.
Regarding the error that you received on the mylisync side, try to use another meilisync docker image. Which image do you use? I also got some errors, but later I tried docker image "long2ice/meilisync:dev" and haven't got any errors since.

@TEAMMEDIO and you don't have issues with the RDS database replication slot growing continuousely?

@iurie-bogdanovici,
sure, I got an error indicating that the slot has exceeded its maximum reserved size, rendering it unable to receive changes.
While we're actively monitoring the slot's size for anomalies, we're considering implementing a temporary solution. This involves creating a cron job on the EC2 instance to shut down Meilisync, delete the replication slot from the database, and then restart Meilisync!!!

Anyone has got it fully working yet?

I believe this is because meilisync is not confirming every processed message by calling send_feedback() method on the corresponding replication cursor.
See the docs for consume_stream here where they warn:

Warning: When using replication with slots, failure to constantly consume and report success to the server appropriately can eventually lead to “disk full” condition on the server, because the server retains all the WAL segments that might be needed to stream the changes via all of the currently open replication slots.

@MattExact Thanks. Now the issue is partially solved. Each time a message is processed the replication slot is consumed. The only issue is that if for example there is no activity for the RDS db (in our case is like this for dev env), the replication slot will grow. Is it possible to consume replication slot periodically (once an hour) if there is no activity?
For example for another similar application called "pgsync" this is implemented as an environment variable REPLICATION_SLOT_CLEANUP_INTERVAL (controls the interval in which the replication slot is cleaned up. The default value is every hour).

@MattExact Thanks. Now the issue is partially solved. Each time a message is processed the replication slot is consumed. The only issue is that if for example there is no activity for the RDS db (in our case is like this for dev env), the replication slot will grow.

@iurie-bogdanovici Ah that's interesting, from my understanding of the psycopg2 documentation, clearing at an interval is not necessary.
But I wonder if the problem is the early return if there are no changes, meaning there could be many replication messages that are not being acknowledged on an idle database.

If we change

changes = payload.get("change")
if not changes:
return

to something like the following (and fix the silly indentation error I made for msg.cursor.send_feedback) then I think it might fully solve the problem.

        changes = payload.get("change", [])

@MattExact Big thanks to you. Now is working as expected, the replication slot on RDS is not increasing in size even if there is no activity on the db.