Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the node restarts, the pg_last_wal_replay_lsn() is used as the LSN location for election. #228

Open
gavinThinking opened this issue Apr 9, 2024 · 0 comments

Comments

@gavinThinking
Copy link

gavinThinking commented Apr 9, 2024

Test environment

3 nodes cluster with DB synchrouous replication enabled.

Full List of Resources:
Clone Set: pgsql-ha(test-db)  (promotable):
    * test-db    (ocf::***:pgsqlms):      Slave server_01
    * test-db    (ocf::***:pgsqlms):      Master server_02
    * test-db    (ocf::***:pgsqlms):      Slave server_03
Node Attributes:
  * Node: server_03:     
    * master-score           	: 1000      
  * Node: server_01:   
    * master-score          	: 990       
  * Node: server_02:  
    * master-score          	: 1001      

The streaming replication configuration for the other two standby servers is set to "ANY 1" meaning that there's a possibility one of them may have outdated data.

	synchronous_standby_names                                             
---------------------------------------------------------------------------------------------------------------------
 ANY 1 (server_01,server_03)

Problem Statement

When the cluster is shut down and then restarted, during the pre-promote stage, the LSN location obtained for each node is the starting point of the last segment in the local pg_wal folder, which is inconsistent with the actual LSN location in each node's own pg_wal folder.
We all know that the standby's master score update has a certain interval, creating a time window during which the master score cannot reflect the true lag between the standby and primary. Then, when the cluster starts, if the old primary doesn't start up promptly, it can lead to the other two nodes participating in the election. The node with lowest node name (alphanumeric sort) of master score, among those with a master score of 1000, is promoted first. However, at this point, the LSN location cannot reflect the true LSN location. This scenario might result in electing a stale copy as the new master, leading to data loss.

Oct 08 09:16:22.002567 server_02 pgsqlms(test-db)[64293]: INFO: Promoting instance on node "server_02"
Oct 08 09:16:22.123181 server_02 pgsqlms(test-db)[64314]: INFO: Current node TL#LSN: 6#4362076160    <<<<<<< 0x104000000
Oct 08 09:16:22.129258 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_02]: (unset) -> 6#4362076160  
Oct 08 09:16:22.141162 server_02 pacemaker-attrd[21903]:  notice: Setting nodes-test-db[server_02]: (unset) -> server_01 server_02 server_03
Oct 08 09:16:22.158226 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_01]: (unset) -> 6#4362076160     <<<<<<< 0x104000000
Oct 08 09:16:22.426240 server_02 pacemaker-attrd[21903]:  notice: Setting lsn_location-test-db[server_03]: (unset) -> 6#4362076160     <<<<<<< 0x104000000

Root cause analysis

If the standby instance remains without a primary instance to synchronize with, then the value of pg_last_wal_receive_lsn() will always be the initial value, which is the starting point of the last segment in the local pg_wal folder.

Solution

In most cases, the pg_last_wal_receive_lsn() can accurately retrieve the last received LSN location.

We choose pg_last_wal_receive_lsn() because a standby can lag replaying WAL based eg. on its read only activity. That means the standby that received more data from the primary than the others might have replayed less of them during the monitor or promote action.

Therefore, we still use this pg_last_wal_receive_lsn() of obtaining the LSN location in most situations.
In the scenario where the entire cluster has just restarted:
If the last three bytes (or six hexadecimal digits) of the last received LSN are zeros, indicating that the LSN is the starting point of the last WAL segment in the local pg_wal folder, then the current LSN is not accurate.
In this case, we query the last replayed LSN and compare it with the last received LSN. If the value of the last replay LSN is greater than the last received LSN, we use the last replay LSN as the LSN location.
Note: The scope of the changes only involves scenarios where the cluster is restarting.

When the standby is restarted, it must replay the transaction log to bring the database tables back to their correct state.
So in this scenario, the last replayed LSN is accurate.

pg_is_in_recovery() pg_last_wal_receive_lsn() pg_last_wal_replay_lsn()
t 1/86000000 1/862B9CC0

@ioguix We've discussed this issue: #225 before, and now I've limited this change solely to the cluster startup scenario. So please take another look at my PR:#227, and I really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant