supabase/realtime

Not works with Amazon Aurora

mats16 opened this issue · 16 comments

Bug report

Describe the bug

The supabase-realtime server does not work with Amazon Aurora PostgreSQL.

To Reproduce

  1. Change parameters for replications
      parameters: {
        'rds.logical_replication': '1',
        'max_replication_slots': '5',
        'max_wal_senders': '10',
        'max_logical_replication_workers': '2',
        'max_slot_wal_keep_size': '1024',
      },
  1. Create PostgreSQL Cluster (v14.3)

  2. Launch supabase-realtime server and connect DB

  3. The following log is displayed

2022-08-02 00:41:28.315 [info] Running RealtimeWeb.Endpoint with cowboy 2.8.0 at :::4000 (http)

2022-08-02 00:41:28.316 [info] Access RealtimeWeb.Endpoint at http://localhost:4000

2022-08-02 00:42:28.341 [error] %RuntimeError{message: "operation :rollback is manually rolling back, which is not supported by Ecto.Multi"}

2022-08-02 00:42:28.377 [error] %Postgrex.Error{connection_id: 16419, message: nil, postgres: %{code: :undefined_object, file: "slot.c", line: "418", message: "replication slot \"realtime_rls\" does not exist", pg_code: "42704", routine: "ReplicationSlotAcquire", severity: "ERROR", unknown: "ERROR"}, query: nil}

Expected behavior

It works well, because it does not rely on extensions.

Screenshots

N/A

System information

  • Version of supabase-realtime: v0.22.7
  • Version of PostgreSQL: 14.3

Additional context

Add any other context about the problem here.

Thanks! I already checked it...

I had created supabase_admin as rds_superuser, but it does not work.

@mats16 can you please share your Realtime server config, if possible?

Thanks.
This is my settings.

ParameterGroup (for Amazon Aurora PostgreSQL 14):

    const parameterGroup = new rds.ParameterGroup(scope, 'ParameterGroup', {
      engine: rds.DatabaseClusterEngine.auroraPostgres({ version: rds.AuroraPostgresEngineVersion.of('14.3', '14') }),
      parameters: {
        'rds.logical_replication': '1',
        'max_replication_slots': '5', // Default Aurora:20, Supabase:5
        'max_wal_senders': '10', // Default Aurora:20, Supabase:10
        'max_logical_replication_workers': '2',
        'autovacuum_max_workers': 'GREATEST({DBInstanceClassMemory/64371566592},2)', // Default: GREATEST({DBInstanceClassMemory/64371566592},3)
        'max_parallel_workers': '4', // Default: GREATEST(${DBInstanceVCPU/2},8)

        'max_slot_wal_keep_size': '1024', // https://github.com/supabase/realtime
      },
    });

Realtime server config as docker-compose.yml:

version: '3'
services:
  realtime:
    image: supabase/realtime:v0.22.7
    ports:
      - 4000:4000
    environment:
      DB_HOST: "xxxxx.cluster-xxxxx.us-west-2.rds.amazonaws.com"
      DB_NAME: postgres
      DB_USER: supabase_admin # it is rds_superuser
      DB_PASSWORD: "xxxx"
      DB_PORT: 5432
      DB_SSL: "false"
      PORT: 4000
      JWT_SECRET: "xxx"
      REPLICATION_MODE: RLS
      REPLICATION_POLL_INTERVAL: 100
      SECURE_CHANNELS: "false"
      SLOT_NAME: realtime_rls
      TEMPORARY_SLOT: "true"
      MAX_REPLICATION_LAG_MB: 1000
    command: >
      bash -c "./prod/rel/realtime/bin/realtime eval Realtime.Release.migrate
      && ./prod/rel/realtime/bin/realtime start"

Test CDK project: https://github.com/mats16/supabase-on-aws

@mats16 can you take a look at https://github.com/supabase/realtime/blob/master/dev/postgres/mnt/00-setup.sql and make sure you have everything like a realtime schema, a publication called supabase_realtime, etc.?

@w3b6x9 My referrence is https://github.com/supabase/supabase/blob/master/docker/volumes/db/init/00-initial-schema.sql, so I have already created schema and publication.

I noticed that logs published at intervals of the value of SUBSCRIPTION_SYNC_INTERVAL.

Logs about SUBSCRIPTION_SYNC_INTERVAL: 10000

environment-realtime-1  | 2022-08-07 07:25:39.301 [info] Migrations already up
environment-realtime-1  | 2022-08-07 07:25:46.679 [info] Running RealtimeWeb.Endpoint with cowboy 2.8.0 at :::4000 (http)
environment-realtime-1  | 2022-08-07 07:25:46.680 [info] Access RealtimeWeb.Endpoint at http://localhost:4000
environment-realtime-1  | 2022-08-07 07:25:49.829 [info] tzdata release in place is from a file last modified Wed, 21 Oct 2020 18:40:20 GMT. Release file on server was last modified Wed, 16 Mar 2022 13:36:02 GMT.
environment-realtime-1  | 2022-08-07 07:25:51.029 [info] Tzdata has updated the release from 2020d to 2022a
environment-realtime-1  | 2022-08-07 07:25:56.696 [error] %RuntimeError{message: "operation :rollback is manually rolling back, which is not supported by Ecto.Multi"}
environment-realtime-1  | 2022-08-07 07:25:56.939 [error] %Postgrex.Error{connection_id: 8062, message: nil, postgres: %{code: :undefined_object, file: "slot.c", line: "418", message: "replication slot \"cloud9_realtime_rls\" does not exist", pg_code: "42704", routine: "ReplicationSlotAcquire", severity: "ERROR", unknown: "ERROR"}, query: nil}
environment-realtime-1  | 2022-08-07 07:26:06.711 [error] %RuntimeError{message: "operation :rollback is manually rolling back, which is not supported by Ecto.Multi"}
environment-realtime-1  | 2022-08-07 07:26:07.276 [error] %Postgrex.Error{connection_id: 8062, message: nil, postgres: %{code: :undefined_object, file: "slot.c", line: "418", message: "replication slot \"cloud9_realtime_rls\" does not exist", pg_code: "42704", routine: "ReplicationSlotAcquire", severity: "ERROR", unknown: "ERROR"}, query: nil}
environment-realtime-1  | 2022-08-07 07:26:16.727 [error] %RuntimeError{message: "operation :rollback is manually rolling back, which is not supported by Ecto.Multi"}
environment-realtime-1  | 2022-08-07 07:26:17.667 [error] %Postgrex.Error{connection_id: 8062, message: nil, postgres: %{code: :undefined_object, file: "slot.c", line: "418", message: "replication slot \"cloud9_realtime_rls\" does not exist", pg_code: "42704", routine: "ReplicationSlotAcquire", severity: "ERROR", unknown: "ERROR"}, query: nil}
environment-realtime-1  | 2022-08-07 07:26:26.744 [error] %RuntimeError{message: "operation :rollback is manually rolling back, which is not supported by Ecto.Multi"}
environment-realtime-1  | 2022-08-07 07:26:27.078 [error] %Postgrex.Error{connection_id: 8062, message: nil, postgres: %{code: :undefined_object, file: "slot.c", line: "418", message: "replication slot \"cloud9_realtime_rls\" does not exist", pg_code: "42704", routine: "ReplicationSlotAcquire", severity: "ERROR", unknown: "ERROR"}, query: nil}
environment-realtime-1  | 2022-08-07 07:26:36.760 [error] %RuntimeError{message: "operation :rollback is manually rolling back, which is not supported by Ecto.Multi"}
environment-realtime-1  | 2022-08-07 07:26:37.447 [error] %Postgrex.Error{connection_id: 8062, message: nil, postgres: %{code: :undefined_object, file: "slot.c", line: "418", message: "replication slot \"cloud9_realtime_rls\" does not exist", pg_code: "42704", routine: "ReplicationSlotAcquire", severity: "ERROR", unknown: "ERROR"}, query: nil}
environment-realtime-1  | 2022-08-07 07:26:46.774 [error] %RuntimeError{message: "operation :rollback is manually rolling back, which is not supported by Ecto.Multi"}
environment-realtime-1  | 2022-08-07 07:26:46.858 [error] %Postgrex.Error{connection_id: 8062, message: nil, postgres: %{code: :undefined_object, file: "slot.c", line: "418", message: "replication slot \"cloud9_realtime_rls\" does not exist", pg_code: "42704", routine: "ReplicationSlotAcquire", severity: "ERROR", unknown: "ERROR"}, query: nil}
environment-realtime-1  | 2022-08-07 07:26:56.791 [error] %RuntimeError{message: "operation :rollback is manually rolling back, which is not supported by Ecto.Multi"}
environment-realtime-1  | 2022-08-07 07:26:57.220 [error] %Postgrex.Error{connection_id: 8062, message: nil, postgres: %{code: :undefined_object, file: "slot.c", line: "418", message: "replication slot \"cloud9_realtime_rls\" does not exist", pg_code: "42704", routine: "ReplicationSlotAcquire", severity: "ERROR", unknown: "ERROR"}, query: nil}
environment-realtime-1  | 2022-08-07 07:27:06.806 [error] %RuntimeError{message: "operation :rollback is manually rolling back, which is not supported by Ecto.Multi"}
environment-realtime-1  | 2022-08-07 07:27:07.555 [error] %Postgrex.Error{connection_id: 8062, message: nil, postgres: %{code: :undefined_object, file: "slot.c", line: "418", message: "replication slot \"cloud9_realtime_rls\" does not exist", pg_code: "42704", routine: "ReplicationSlotAcquire", severity: "ERROR", unknown: "ERROR"}, query: nil}

Which query is to confirm connected subscribers and the tables?
I could not find it, also I am not good with Elixir.

@mats16 do you see a realtime.schema_migrations table in your database? If you do, what's the latest migration you see?

@w3b6x9 My latest migration version is 20220712093339.

@mats16 This should be the culprit:

def sync_subscriptions(params_list) do
.

I do want to back up and mention that SECURE_CHANNELS is required to be true when REPLICATION_MODE is RLS. If you don't care about RLS and just want to get it working for the time being you should try it with REPLICATION_MODE set to STREAM.

Otherwise, make sure that you set SECURE_CHANNELS to true, sign a JWT with your JWT secret and make sure there's a role key in your claims. The role points to any existing db role (e.g. supabase_admin). Have you added any tables to the supabase_realtime publication yet? If not, you should do that as soon as possible.

@w3b6x9 thanks.
I connected in stream mode and it worked.

I also noticed the following error about web socket connection and no record existed in realtime.subscription table.

{
  "event":"phx_reply",
  "payload":{
    "response":{
      "reason":"error occurred when joining realtime:public:messages with user token"
    },
    "status":"error"
  },
  "ref":"1",
  "topic":"realtime:public:messages"
}

It is probably error related pgjwt or permissions.

We cannot use pgjwt on RDS, so we need to install it via SQL. If functions such as verify work in extensions schema, there is no problem, right?

(Also, sorry for creating the issue as a bug.)

@mats16 ah, that's my bad. We've had a few versions of Realtime and the latest one no longer respects the SECURE_CHANNELS flag. You will have to set it to true, generate a JWT (you can use an online tool for convenience) with your JWT_SECRET (whatever xxx actually is), pass that in, and you should be fine.

Realtime doesn't use the pgjwt extension. JWT verification and validation is done on the Elixir side.

Also, sorry for creating the issue as a bug.

No worries, just want to make sure you're unblocked as soon as possible.

@w3b6x9 Thanks

I think I have found the root cause.

I confirmed that Aurora had received a query select rolname from pg_authid .
https://github.com/supabase/realtime/blob/master/server/lib/realtime/rls/subscriptions/subscriptions.ex#L52

But we can not have permission to pg_authid table on Amazon Aurora/RDS, because rds_superuser role is not real SUPERUSER role.

Is it possible to change using pg_authid to pg_roles (or pg_user)?
I feel it is appropriate for security purposes not only for Amazon Aurora. Because pg_authid contains a password. (although it is hashed)

Is it possible to change using pg_authid to pg_roles (or pg_user)?

@mats16 you're right, switching to pg_roles is better for security as it blanks the password. Would you like to do the honors and submit a PR? It should be as simple as replacing pg_authid for pg_roles.

@w3b6x9 I tried it and confirmed it works with replacing pg_authid for pg_roles. I'll create PR.

@mats16 thanks for the PR!