ClusterLabs/PAF

Stale DB instance being promoted leads to data loss

gavinThinking opened this issue · 12 comments

Hi,
I found a bug in recent testing, and I've provide a suggested fix, could you help me assess whether this solution is viable? Thank you.

Test environment

3 node cluster with DB synchrouous replication enabled.

Full List of Resources:
Clone Set: pgsql-ha(test-db)  (promotable):
    * test-db    (ocf::***:pgsqlms):      Slave server_01
    * test-db    (ocf::***:pgsqlms):      Master server_02
    * test-db    (ocf::***:pgsqlms):      Slave server_03
Node Attributes:
  * Node: server_03:     
    * master-score           	: 1000      
  * Node: server_01:   
    * master-score          	: 990       
  * Node: server_02:  
    * master-score          	: 1001      
	synchronous_standby_names                                             
---------------------------------------------------------------------------------------------------------------------
 ANY 1 (server_01,server_03)

Test step

  1. Kill the server_01 database instance. The cluster can still process read and write request without any issues.
    image

  2. Send a write request, such as creating a new user.
    The user creation was successful, and we can retrieve the record from the database.

  3. Kill the remaining two database instances Master server_02 and Slave server_03

  4. When these 3 database instances recover, during the master election process, I noticed that all three instances have the same "lsn_location" value, which is the starting point of the last segment in the local pg_wal directory.

    $node_lsn = _get_last_received_lsn( 'in decimal' );

  5. In our logic, we added that when all three "lsn_location" values are the same, we choose the database instance with the best performance as the master. So if the server_01 has the best performance, it is elected as the new master. At this point, we noticed that the user data we created earlier is lost.

Test logs

Master server_02
Instance "test-db" controldata indicates a running primary instance, the instance has probably crashed

Oct 08 09:14:53.390237 server_02 pgsqlms(test-db)[56347]: INFO: pgsql_monitor: instance "test-db" is not listening
Oct 08 09:14:53.439523 server_02 pgsqlms(test-db)[56368]: INFO: _confirm_stopped: no postmaster process found for instance "test-db"
Oct 08 09:14:53.498487 server_02 pgsqlms(test-db)[56387]: INFO: _controldata: instance "test-db" state is "in production"
Oct 08 09:14:53.509602 server_02 pgsqlms(test-db)[56392]: ERROR: Instance "test-db" controldata indicates a running primary instance, the instance has probably crashed

Setting lsn_location
all three instances have the same "lsn_location" value, which is the starting point of the last segment in the local pg_wal directory.
Hexdecimal 104000000 Decimal 4362076160

Oct 08 09:16:22.002567 server_02 pgsqlms(test-db)[64293]: INFO: Promoting instance on node "server_02"
Oct 08 09:16:22.123181 server_02 pgsqlms(test-db)[64314]: INFO: Current node TL#LSN: 6#4362076160
Oct 08 09:16:22.129258 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_02]: (unset) -> 6#4362076160
Oct 08 09:16:22.141162 server_02 pacemaker-attrd[21903]:  notice: Setting nodes-test-db[server_02]: (unset) -> server_01 server_02 server_03
Oct 08 09:16:22.158226 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_01]: (unset) -> 6#4362076160
Oct 08 09:16:22.426240 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_03]: (unset) -> 6#4362076160
Oct 08 09:16:22.568704 server_02 pgsqlms(test-db)[64354]: INFO: Action: "promote"
Oct 08 09:16:22.668577 server_02 pgsqlms(test-db)[64370]: WARNING: _confirm_role: secondary not streaming wal from primary
Oct 08 09:16:22.670723 server_02 pgsqlms(test-db)[64371]: INFO: pgsql_promote: "test-db" currently running as a standby
Oct 08 09:16:22.686690 server_02 pgsqlms(test-db)[64374]: INFO: pgsql_promote: checking if current node is the best candidate for promotion
Oct 08 09:16:22.704388 server_02 pgsqlms(test-db)[64377]: INFO: pgsql_promote: current node TL#LSN location: 6#4362076160
Oct 08 09:16:22.714073 server_02 pgsqlms(test-db)[64379]: INFO: pgsql_promote: current node score: 90
Oct 08 09:16:22.994833 server_02 pgsqlms(test-db)[64416]: INFO: pgsql_promote: comparing with "server_01": TL#LSN is 6#4362076160
Oct 08 09:16:23.037057 server_02 pgsqlms(test-db)[64426]: INFO: pgsql_promote: "server_01" has a matching TL#LSN, also checking node score
Oct 08 09:16:23.038658 server_02 pgsqlms(test-db)[64427]: INFO: pgsql_promote: comparing with "server_01": node score is 100
Oct 08 09:16:23.041612 server_02 pgsqlms(test-db)[64429]: INFO: pgsql_promote: "server_01" is a better candidate to promote (node score > server_02)
Oct 08 09:16:23.375242 server_02 pgsqlms(test-db)[64513]: INFO: pgsql_promote: comparing with "server_03": TL#LSN is 6#4362076160
Oct 08 09:16:23.384625 server_02 pgsqlms(test-db)[64515]: INFO: pgsql_promote: "server_03" has a matching TL#LSN, also checking node score
Oct 08 09:16:23.386169 server_02 pgsqlms(test-db)[64516]: INFO: pgsql_promote: comparing with "server_03": node score is 80
Oct 08 09:16:23.395775 server_02 pgsqlms(test-db)[64518]: ERROR: server_01 is the best candidate to promote, aborting current promotion
Oct 08 09:16:23.586478 server_02 pgsqlms(test-db)[64548]: INFO: pgsql_promote: move master role to server_01

Root cause analysis

The root cause of this issue is the incorrect usage of pg_last_wal_receive_lsn() for elections, leading to an election error and subsequent data loss.
When a database instance starts as standby, the pg_last_wal_receive_lsn is initialized to the starting point of the last segement in the local pg_wal folder.
https://github.com/postgres/postgres/blob/e7689190b3d58404abbafe2d3312c3268a51cca3/src/backend/access/transam/xlogfuncs.c#L343
https://github.com/postgres/postgres/blob/e7689190b3d58404abbafe2d3312c3268a51cca3/src/backend/replication/walreceiverfuncs.c#L332
https://github.com/postgres/postgres/blob/e7689190b3d58404abbafe2d3312c3268a51cca3/src/backend/replication/walreceiverfuncs.c#L302
If this is the first startup of walreceiver (on this timeline), initialize flushedUpto and latestChunkStart to the starting point.

/*
 * Report the last WAL receive location (same format as pg_backup_start etc)
 *
 * This is useful for determining how much of WAL is guaranteed to be received
 * and synced to disk by walreceiver.
 */
Datum
pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
{
	XLogRecPtr	recptr;

	recptr = GetWalRcvFlushRecPtr(NULL, NULL);

	if (recptr == 0)
		PG_RETURN_NULL();

	PG_RETURN_LSN(recptr);
}

/*
 * Returns the last+1 byte position that walreceiver has flushed.
 *
 * Optionally, returns the previous chunk start, that is the first byte
 * written in the most recent walreceiver flush cycle.  Callers not
 * interested in that value may pass NULL for latestChunkStart. Same for
 * receiveTLI.
 */
XLogRecPtr
GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
{
	WalRcvData *walrcv = WalRcv;
	XLogRecPtr	recptr;

	SpinLockAcquire(&walrcv->mutex);
	recptr = walrcv->flushedUpto;  <<<<<<<<<<<<<<<<<<<<<<<<<<
	if (latestChunkStart)
		*latestChunkStart = walrcv->latestChunkStart;
	if (receiveTLI)
		*receiveTLI = walrcv->receivedTLI;
	SpinLockRelease(&walrcv->mutex);

	return recptr;
}

/*
	 * If this is the first startup of walreceiver (on this timeline),
	 * initialize flushedUpto and latestChunkStart to the starting point.
	 */
	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
	{
		walrcv->flushedUpto = recptr;   
		walrcv->receivedTLI = tli;
		walrcv->latestChunkStart = recptr;
	}

Question: In what scenarios is the pg_last_wal_receive_lsn() obtained incorrectly? In other words, the pg_last_wal_receive_lsn() is not what we need.

Answer: When Postgres restarts as a standby, the WAL receiver process is initiated, and the pg_last_wal_receive_lsn() is initialized to the starting point of the last segment in the local WAL directory.

For example, if we observe that the last segment in the following directory is "000000070000000000000058", then the pg_last_wal_receive_lsn() will be initialized as 0x58000000.

Proposed solution

Use pg_last_wal_replay_lsn () instead of pg_last_wal_receive_lsn() to get the lsn_location

My questions

I have a question, why did choose to use the pg_last_wal_receive_lsn() instead of the pg_last_wal_replay_lsn () in the lsn_location design? Was there any particular reason for this choice? Thanks in advance.

Best Regards
Gavin

ioguix commented

Hi @gavinThinking,

Thank you for your report and analysis.

We choose pg_last_wal_receive_lsn() because a standby can lag replaying WAL based eg. on its read only activity. That means the standby that received more data from the primary than the others might have replayed less of them during the monitor or promote action.

During the promotion, we really want to choose the standby that received more data from the primary. Even if it did not replayed all of them yet, the promotion will actually wait for it to replay everything before returning.

In your scenario, you actually have multiple failures:

  1. you have one failing standby node (1st failure)
  2. you have some more writes
  3. then you have the last two other nodes failing (2nd failure)
  4. all of them come back and cluster choose to promote the first node that failed, based on performance assumption.

I understand this could happen in the wild, but this is somewhat artificial and a multi-failure scenario.

Moreover, there's something fishy I don't understand in the first place: your server_01 shouldn't have been promoted in the first place. Even if all your node where killed, one of them has been killed as a primary, the two others as standby. PAF is able to detect that, at the very least from the promotion score kept in the cluster side.

So, no matter the performance assumptions, the promotion scores should rule the promotion and choose the previous primary node when you start all your nodes again.

@ioguix Thank you very much for your quick response.
I collected more logs from one of the database instances before the promotion.
After immediate shutdown and then restarting it.

select pg_is_in_recovery(),pg_is_wal_replay_paused(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(), pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '0/0'), pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '0/0');

pg_is_in_recovery() pg_is_wal_replay_paused() pg_last_wal_receive_lsn() pg_last_wal_replay_lsn() pg_last_xact_replay_timestamp() pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '0/0') pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '0/0')
t f 1/86000000 1/862B9CC0 2023-10-13 07:54:21.50669+0 6543114240 6545972416

The last segment in the in the pg_wal directory is:

Oct 13 08:02 0000000F0000000100000086

By examining the content of the 0000000F0000000100000086, we can determine that the starting LSN of the last WAL record is 1/862B96F8, and the ending LSN is 1/862B9CC0.

$ pg_waldump 0000000F0000000100000086 > f86.txt
pg_waldump: fatal: error in WAL record at 1/862B96F8: invalid record length at 1/862B9CC0: wanted 24, got 0`
$ vim f86.txt
mgr: Transaction len (rec/tot): 1473/ 1473, tx: 1804333, lsn: 1/862B96F8, prev 1/862B96A8, desc: COMMIT 2023-10-13 07:54:21.506690 UTC;

When postgres starts in standby mode, it undergoes a recovery process and eventually reaches a consistent state, allowing it to accept ready only connections.

Oct 13 07:55:34 server_02 postgres[122568]: [1-1]  LOG:  starting PostgreSQL 12.15 
Oct 13 07:55:34 server_02 postgres[122582]: [8-1]  LOG:  database system was interrupted; last known up at 2023-10-13 07:50:38 UTC
Oct 13 07:55:34 server_02 postgres[122582]: [9-1]  LOG:  entering standby mode
Oct 13 07:55:34 server_02 postgres[122582]: [10-1]  LOG:  database system was not properly shut down; automatic recovery in progress
Oct 13 07:55:34 server_02 postgres[122582]: [11-1]  LOG:  redo starts at 1/860E4188
Oct 13 07:55:34 server_02 postgres[122582]: [12-1]  LOG:  invalid record length at 1/862B9CC0: wanted 24, got 0
Oct 13 07:55:34 server_02 postgres[122582]: [13-1]  LOG:  consistent recovery state reached at 1/862B9CC0
Oct 13 07:55:34 server_02 postgres[122568]: [8-1]  LOG:  database system is ready to accept read only connections

So in this scenario, it seems that there is a significant gap between the value obtained by "pg_last_wal_receive_lsn()" and the actual LSN of the last recored in the local pg_wal directory.
Additionally, in the same scenario, we can see that the value obtained by "pg_last_wal_replay_lsn()" is correct.

@ioguix Do you have any good advice for this issue? Thanks.

For the question as follows, let me explain further.

Moreover, there's something fishy I don't understand in the first place: your server_01 shouldn't have been promoted in the first place. Even if all your node where killed, one of them has been killed as a primary, the two others as standby. PAF is able to detect that, at the very least from the promotion score kept in the cluster side.
So, no matter the performance assumptions, the promotion scores should rule the promotion and choose the previous primary node when you start all your nodes again.

Indeed, PAF does attempt to promote the failed master server_02 first because it has the highest master score(1001).

Oct 08 09:16:22.002567 server_02 pgsqlms(test-db)[64293]: INFO: Promoting instance on node "server_02"
Oct 08 09:16:22.123181 server_02 pgsqlms(test-db)[64314]: INFO: Current node TL#LSN: 6#4362076160
Oct 08 09:16:22.129258 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_02]: (unset) -> 6#4362076160
Oct 08 09:16:22.141162 server_02 pacemaker-attrd[21903]:  notice: Setting nodes-test-db[server_02]: (unset) -> server_01 server_02 server_03
Oct 08 09:16:22.158226 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_01]: (unset) -> 6#4362076160
Oct 08 09:16:22.426240 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_03]: (unset) -> 6#4362076160

When the lsn_location are the same, the addition of a new condition resulting in the node with the best performance as the new master.

Oct 08 09:16:22.568704 server_02 pgsqlms(test-db)[64354]: INFO: Action: "promote"
Oct 08 09:16:22.668577 server_02 pgsqlms(test-db)[64370]: WARNING: _confirm_role: secondary not streaming wal from primary
Oct 08 09:16:22.670723 server_02 pgsqlms(test-db)[64371]: INFO: pgsql_promote: "test-db" currently running as a standby
Oct 08 09:16:22.686690 server_02 pgsqlms(test-db)[64374]: INFO: pgsql_promote: checking if current node is the best candidate for promotion
Oct 08 09:16:22.704388 server_02 pgsqlms(test-db)[64377]: INFO: pgsql_promote: current node TL#LSN location: 6#4362076160
Oct 08 09:16:22.714073 server_02 pgsqlms(test-db)[64379]: INFO: pgsql_promote: current node score: 90
Oct 08 09:16:22.994833 server_02 pgsqlms(test-db)[64416]: INFO: pgsql_promote: comparing with "server_01": TL#LSN is 6#4362076160
Oct 08 09:16:23.037057 server_02 pgsqlms(test-db)[64426]: INFO: pgsql_promote: "server_01" has a matching TL#LSN, also checking node score
Oct 08 09:16:23.038658 server_02 pgsqlms(test-db)[64427]: INFO: pgsql_promote: comparing with "server_01": node score is 100
Oct 08 09:16:23.041612 server_02 pgsqlms(test-db)[64429]: INFO: pgsql_promote: "server_01" is a better candidate to promote (node score > server_02)
Oct 08 09:16:23.375242 server_02 pgsqlms(test-db)[64513]: INFO: pgsql_promote: comparing with "server_03": TL#LSN is 6#4362076160
Oct 08 09:16:23.384625 server_02 pgsqlms(test-db)[64515]: INFO: pgsql_promote: "server_03" has a matching TL#LSN, also checking node score
Oct 08 09:16:23.386169 server_02 pgsqlms(test-db)[64516]: INFO: pgsql_promote: comparing with "server_03": node score is 80
Oct 08 09:16:23.395775 server_02 pgsqlms(test-db)[64518]: ERROR: server_01 is the best candidate to promote, aborting current promotion
Oct 08 09:16:23.586478 server_02 pgsqlms(test-db)[64548]: INFO: pgsql_promote: move master role to server_01

Now, another scenario can also potentially lead to data loss.
If all three nodes simultaneously stop(For instance, perform a cluster shutdown operation) and the failed master server_02 cannot start due to a fault.
We will elect a new master from the other two standbys.
As per my understanding, the master score is updated in pgsql_monitor at 30-second intervals. The longer update interval for the master score may not accurately reflect the actual situation.
So, in this scenario, the node with the worst actual LSN might be elected as the new master, which is not the desired outcome.

ioguix commented

As per my understanding, the master score is updated in pgsql_monitor at 30-second intervals. The longer update interval for the master score may not accurately reflect the actual situation.
So, in this scenario, the node with the worst actual LSN might be elected as the new master, which is not the desired outcome.

Yes, the situation could change between two monitor actions executed on the primary and the score might be wrong and PAF actually handle that. See: https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L1868

These scores on standbies are just recommendations. During the pre-promote action, all nodes share their current LSN. During the promotion action on the node-to-promote, this one actually check it has the best score from all known nodes. If another node has a higher LSN, the promotion score are updated and the current promotion transition is canceled, leading to a new one.

So, to wrap this up:

  • promotion scores updated during the monitor action are recommendation
  • during the actual promote action, a real election occurs
ioguix commented

Oct 08 09:16:23.037057 server_02 pgsqlms(test-db)[64426]: INFO: pgsql_promote: "server_01" has a matching TL#LSN, also checking node score
Oct 08 09:16:23.038658 server_02 pgsqlms(test-db)[64427]: INFO: pgsql_promote: comparing with "server_01": node score is 100

These log messages are dubious to me. This is not pristine pgsqlms or my memory fails hard...

In our logic, we added that when all three "lsn_location" values are the same, we choose the database instance with the best performance as the master.

OK, when reading your first message, I did not understand you were using a patched version of pgsqlms and it seems your problems boils down to your actual changes.

I wonder if there could be a better way to achieve what you are trying to do without patching pgsqlms. Maybe you should try to play with location scores or node-health-* cluster parameters (I never did) or some colocations with a custom resource agent?

@ioguix Thank you very much for the summary of the master score update scenario.

  • promotion scores updated during the monitor action are recommendation
  • during the actual promote action, a real election occurs

Regarding your "location scores or node-health-* cluster parameters" suggestion, I will attempt to replace the part I patched with location scores or node-health-*, and then conduct testing. Thank you.

@ioguix For the issue with "pg_last_wal_receive_lsn()" not obtaining the last record in the local pg_wal folder when the standy is just started, do you have any suggestion to fix this issue? Thanks in advance.

select pg_is_in_recovery(),pg_is_wal_replay_paused(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(), pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '0/0'), pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '0/0');

pg_is_in_recovery() pg_is_wal_replay_paused() pg_last_wal_receive_lsn() pg_last_wal_replay_lsn() pg_last_xact_replay_timestamp() pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '0/0') pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '0/0')
t f 1/86000000 1/862B9CC0 2023-10-13 07:54:21.50669+0 6543114240 6545972416
ioguix commented

@ioguix For the issue with "pg_last_wal_receive_lsn()" not obtaining the last record in the local pg_wal folder when the standy is just started, do you have any suggestion to fix this issue? Thanks in advance

No, sorry. This is not an actual issue in current pgsqlms pristine code and I would highly recommend you to keep things as simple as possible to avoid staking complexity and bug on top of each others.

If you really want to pursue the patching way, and again I don't not recommend that, you could try to use pg_last_wal_replay_lsn() as long as the standby is NOT connected to a primary, then switch back to pg_last_wal_receive_lsn().

But really, the best you could do is to spend more time looking on Pacemaker-only setup solutions, without patching the resource agent. Even if it means moving the primary role to another node without warning, disrupting your production randomly. Mind you, Pacemaker rules could even allow moving the primary role to a better node only during planned time of the day/week/whatever...

@ioguix, thank you very much for your advice. I also understand your concern about changing to an alternative method "pg_last_wal_replay_lsn".
I will carefully consider your suggestions, make the necessary adjustments, and conduct testing.
Since you've clarified that this is not a real issue in pgsqlms, I will close this ticket.
Thank you for your help.

ioguix commented

Thansk @gavinThinking.

A last advice: join the clusterlabs user mailing list and try to discuss your need here. I feel there might be a solution only involving Pacemaker setup, but I never had to setup such rule myself. I'm sure some Pacemaker devs / experts might be able to answer this.

Regards,

@ioguix your advice is greatly appreciated. If I have any further questions regarding Pacemaker rules, I will seek assistance in the ClusterLab mailling list.
Have a good day.
Best Regards
Gavin

ioguix commented

If I have any further questions regarding Pacemaker rules, I will seek assistance in the ClusterLab mailling list.

I'll follow the discussion there with interest.

Regards,