postgresml/pgcat

select nextval() goes to read replica

Opened this issue · 12 comments

Describe the bug
Hi the nextval() increase the value in the psql sequence. But the thing is, eve within transaction, built in query parser handle the select nextval() and forward it to read replica. Can we make query whitelist/blacklist like pgpool, or can we fix the nextval should be redirect to write replica under read replica?

If I want to add additional logic in query parser, which file should I look into?

Thank you

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

levkk commented

Take a look at the query router. It parses the SQL into an AST using the sqlparser crate. You should be able to identify the called functions and adjust the desired server role accordingly.

Having a list of queries is not going to work long term since just adding a new query to the app will break the behavior. I'm open to other ideas though.

If you want a quick fix, open a transaction before executing your primary-only query. All transactions are routed to the primary automatically.

Take a look at the query router. It parses the SQL into an AST using the sqlparser crate. You should be able to identify the called functions and adjust the desired server role accordingly.

Having a list of queries is not going to work long term since just adding a new query to the app will break the behavior. I'm open to other ideas though.

If you want a quick fix, open a transaction before executing your primary-only query. All transactions are routed to the primary automatically.

Hi levkk,
I already open a transaction but it still pointing to replica somehow (if randomly get the connection from read replica).

pgcat log

root@doh-pgcat-1:~/pgcat# [2023-06-25T00:25:21.781783Z INFO  pgcat::client] Client 103.171.147.127:31380 connected (plain)
[2023-06-25T00:25:21.838753Z WARN  pgcat::server] Server returned with session state altered, discarding state (SET: true, PREPARE: false) for application pgcat
[2023-06-25T00:25:21.871063Z INFO  pgcat::plugins::query_logger] [pool: postgres][user: doh-prod] SET application_name = 'PostgreSQL JDBC Driver'
[2023-06-25T00:25:21.871856Z WARN  pgcat::server] Server returned with session state altered, discarding state (SET: true, PREPARE: false) for application pgcat
[2023-06-25T00:25:21.931223Z INFO  pgcat::plugins::query_logger] [pool: postgres][user: doh-prod] SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED
[2023-06-25T00:25:21.932030Z WARN  pgcat::server] Server returned with session state altered, discarding state (SET: true, PREPARE: false) for application pgcat
[2023-06-25T00:25:21.972311Z INFO  pgcat::plugins::query_logger] [pool: postgres][user: doh-prod] START TRANSACTION
[2023-06-25T00:25:21.972463Z INFO  pgcat::plugins::query_logger] [pool: postgres][user: doh-prod] SELECT id, created_by, created_on, nama_lengkap, nip, ktp, nrp, jenis, email, password, encrypted_username_ppk, jabatan, alamat, golongan, nomor_telepon, update_by, update_on, username, role, id_satker, is_deleted, nosk, id_kldi, pa, username_lpse, is_verified, is_email_valid, id_client, role_2, id_lpse, peg_id FROM pengguna WHERE username = $1
[2023-06-25T00:25:22.028404Z INFO  pgcat::plugins::query_logger] [pool: postgres][user: doh-prod] SELECT nextval('pengguna_id_seq')
[2023-06-25T00:25:22.534638Z INFO  pgcat::plugins::query_logger] [pool: postgres][user: doh-prod] ROLLBACK
[2023-06-25T00:25:22.561718Z INFO  pgcat] Client 103.171.147.127:31380 disconnected, session duration: 0d 00:00:00.837

pgcat.toml:

[general]
host = "0.0.0.0"
port = 5432
admin_username = "doh"
admin_password = "doh"
server_lifetime = 3600000
prepared_statements = false
log_client_connections = false
log_client_disconnections = false
worker_threads = 4
[pools.doh-prod]
query_parser_enabled = true
pool_mode = "transaction"
load_balancing_mode = "random"
default_role = "any"
primary_reads_enabled = true
idle_timeout = 40000
connect_timeout = 3000
[plugins]
[plugins.query_logger]
enabled = false
[pools.doh-prod.plugins.query_logger]
enabled = false
shutdown_timeout = 5000 # milliseconds
[pools.doh-prod.users.0]
username = "doh"
password = "doh"
pool_size = 50
min_pool_size = 5
pool_mode = "transaction"
server_lifetime = 60000
statement_timeout = 0

[pools.doh-prod.shards.0]
servers = [
    [ "xx.xx.xx.xx", 5432, "primary" ],
]
database = "doh-prod"

I didnt use any app connection pool whatsoever in my java application. I only use java.sql.Driver

this.driverProperties.put("user", this.driverProperties.getProperty("user", username));
this.driverProperties.put("password", this.driverProperties.getProperty("password", password));
  
    public Connection getConnection() throws SQLException {
        return this.driver.connect(this.jdbcUrl, this.driverProperties);
    }

But if somehow it randomly fetch the connection from the primary, it wont be an issue.
Is there something I missed in the configuration?

I put the logs here
https://github.com/haposan06/logs/blob/main/pgcat.log

Its really weird, on line 76, when it "START TRANSACTION" it already fetch from the correct pool which is from primary.
But on line 254, while the query_parser_enabled set to true,
it fetched from the read replica on line 258.

Is there any bug related to this one?

levkk commented

Something weird is definitely going on. Do you think you could submit a PR with a simple Java app that I can use to reproduce this? We can use that as part of our test suite to fix this and also to make sure we don't hit this bug in the future in case we change something.

You can put the code for this in tests/java.

My first guess is the Java client is doing something different from the Ruby/Python/Rust clients we've tested so far. My second guess is we have a bug, but I'm not sure where it could be just yet without a reproducible test case.

Something weird is definitely going on. Do you think you could submit a PR with a simple Java app that I can use to reproduce this? We can use that as part of our test suite to fix this and also to make sure we don't hit this bug in the future in case we change something.

You can put the code for this in tests/java.

My first guess is the Java client is doing something different from the Ruby/Python/Rust clients we've tested so far. My second guess is we have a bug, but I'm not sure where it could be just yet without a reproducible test case.

Noted. Will submit PR once I finished it.
By the way, can we put little tutorial on readme.md on how to update the query_parser to allow some query to specific instance?

levkk commented

We can but this functionally doesn't exist, yet.

@levkk
i notice that pgcat has something like this

/* shard_id: 5 */ SELECT * FROM foo WHERE id = 1234;

does the "with comments" feature support to select query redirection based on comments?

/* server: primary */ SELECT * FROM foo WHERE id = 1234;

levkk commented

Not at the moment, but we can add that in if needed: https://github.com/postgresml/pgcat/blob/main/src/query_router.rs#L149

After did some quick debugging, it seems the query_router.infer() seems never been called in client.rs. So the query_router.active_role still holds the default value which is None. So when it executes
let connection = match pool.get(query_router.shard(), query_router.role(), &self.stats)
its randomly pick the available connection from any pool. I also checked the self.transaction_mode already true.

Also, the message buffer is quite strange. The first loop in client.rs it read D, E and S. Because there is no P, or Q evaluated for the "Select nextval()", the query_route.active_role is not changed. Then in the second loop in client.rs also has S which I didnt understand where does that came from.

[2023-07-02T18:26:33.732243Z DEBUG pgcat::client] debug 1st loop - message[0] as char - value D
[2023-07-02T18:26:33.732253Z DEBUG pgcat::client] Anonymous describe
[2023-07-02T18:26:33.732268Z DEBUG pgcat::client] debug 1st loop - message[0] as char - value E
[2023-07-02T18:26:33.732275Z DEBUG pgcat::client] debug 1st loop - message[0] as char - value S
[2023-07-02T18:26:33.732286Z DEBUG pgcat::client] Waiting for connection from pool
[2023-07-02T18:26:33.732294Z DEBUG pgcat::client] debug check role None
[2023-07-02T18:26:33.732985Z DEBUG pgcat::client] Got connection from pool
[2023-07-02T18:26:33.733009Z DEBUG pgcat::client] Client 114.125.88.132:51882 talking to server Address { id: 0, host: "10.10.96.5", port: 5432, shard: 0, database: "sirup-lat", role: Primary, replica_number: 0, address_index: 0, username: "postgres", pool_name: "sirup-prod", mirrors: [], stats: AddressStats { total: AddressStatFields { xact_count: 17, query_count: 21, bytes_received: 12938, bytes_sent: 4866, xact_time: 0, query_time: 296, wait_time: 4447, errors: 0 }, current: AddressStatFields { xact_count: 2, query_count: 2, bytes_received: 231, bytes_sent: 281, xact_time: 0, query_time: 0, wait_time: 3858, errors: 0 }, averages: AddressStatFields { xact_count: 0, query_count: 0, bytes_received: 0, bytes_sent: 0, xact_time: 0, query_time: 0, wait_time: 0, errors: 0 }, averages_updated: false } }
[2023-07-02T18:26:33.733073Z DEBUG pgcat::client] Prepared statement active: None
[2023-07-02T18:26:33.733098Z DEBUG pgcat::client] debug 2nd loop - code / message[0] - value S
[2023-07-02T18:26:33.733110Z DEBUG pgcat::client] Sending query to server
[2023-07-02T18:26:33.733119Z DEBUG pgcat::client] Sending S to server
[2023-07-02T18:26:33.735358Z DEBUG pgcat::server] Server in transaction: true
[2023-07-02T18:26:33.735378Z DEBUG pgcat::client] Prepared statement active: None
[2023-07-02T18:26:33.818311Z DEBUG pgcat::client] debug 2nd loop - code / message[0] - value P
[2023-07-02T18:26:33.818335Z DEBUG pgcat::query_router] Using pool settings, query_parser_enabled: true
[2023-07-02T18:26:33.818339Z DEBUG pgcat::client] debuggJOP1 true - true
[2023-07-02T18:26:33.818349Z DEBUG pgcat::client] Anonymous prepared statement
[2023-07-02T18:26:33.818368Z DEBUG pgcat::query_router] Using pool settings, query_parser_enabled: true
[2023-07-02T18:26:33.818380Z DEBUG pgcat::query_router] Prepared statement: ' SELECT nextval('pengguna_id_seq') '
[2023-07-02T18:26:33.818392Z DEBUG sqlparser::parser] Parsing sql ' SELECT nextval('pengguna_id_seq') '...
[2023-07-02T18:26:33.818426Z DEBUG sqlparser::parser] parsing expr
[2023-07-02T18:26:33.818444Z DEBUG sqlparser::parser] parsing expr
[2023-07-02T18:26:33.818458Z DEBUG sqlparser::parser] prefix: Value(SingleQuotedString("pengguna_id_seq"))
[2023-07-02T18:26:33.818481Z DEBUG sqlparser::parser] get_next_precedence() TokenWithLocation { token: RParen, location: Location { line: 0, column: 0 } }

When it looped to "S", it ending the first loop with default match

Or is there any configuration I can set to change the default query_router.active_role to Primary? I already set pools.[db].default_role to "primary" but it seems no effect (the default value for active_role is still None)

mbanck commented

Regarding the initial issue (SELECT nextval() being routed to a standby), note that pgpool checks if a function that is called in a SELECT query is has Volatility=volatile or not and decides whether to route the query to the primary or a standby based on that:
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/utils/pool_select_walker.c;h=27e745ba473d19a50f91f92aff17b4eb5c2aa94c;hb=HEAD#l385
https://pgpool.net/mediawiki/index.php/Pgpool-II_4.2_development
https://www.pgpool.net/pipermail/pgpool-hackers/2019-September/003438.html

After did some quick debugging, it seems the query_router.infer() seems never been called in client.rs.

This won't help for the nextval(), but if queries aren't getting "infered", make sure you have the query_parser_read_write_splitting = true config entry in your pgcat.toml file.

i'm facing the same issue guys, any update on this ? or quickwin can help us ?
Thank you :)