Automatic failover with repmgrd

repmgrd automatic failover Automatic failover with repmgrd repmgrd is a management and monitoring daemon which runs on each node in a replication cluster. It can automate actions such as failover and updating standbys to follow the new primary, as well as providing monitoring information about the state of each standby. repmgrd witness server witness server repmgrd Using a witness server with repmgrd In a situation caused e.g. by a network interruption between two data centres, it's important to avoid a "split-brain" situation where both sides of the network assume they are the active segment and the side without an active primary unilaterally promotes one of its standbys. To prevent this situation happening, it's essential to ensure that one network segment has a "voting majority", so other segments will know they're in the minority and not attempt to promote a new primary. Where an odd number of servers exists, this is not an issue. However, if each network has an even number of nodes, it's necessary to provide some way of ensuring a majority, which is where the witness server becomes useful. This is not a fully-fledged standby node and is not integrated into replication, but it effectively represents the "casting vote" when deciding which network segment has a majority. A witness server can be set up using repmgr witness register; see also section Using a witness server. It only makes sense to create a witness server in conjunction with running repmgrd; the witness server will require its own repmgrd instance. repmgrd network splits network splits Handling network splits with repmgrd A common pattern for replication cluster setups is to spread servers over more than one datacentre. This can provide benefits such as geographically- distributed read replicas and DR (disaster recovery capability). However this also means there is a risk of disconnection at network level between datacentre locations, which would result in a split-brain scenario if servers in a secondary data centre were no longer able to see the primary in the main data centre and promoted a standby among themselves. &repmgr; enables provision of "" to artificially create a quorum of servers in a particular location, ensuring that nodes in another location will not elect a new primary if they are unable to see the majority of nodes. However this approach does not scale well, particularly with more complex replication setups, e.g. where the majority of nodes are located outside of the primary datacentre. It also means the witness node needs to be managed as an extra PostgreSQL instance outside of the main replication cluster, which adds administrative and programming complexity. repmgr4 introduces the concept of location: each node is associated with an arbitrary location string (default is default); this is set in repmgr.conf, e.g.: node_id=1 node_name=node1 conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2' data_directory='/var/lib/postgresql/data' location='dc1' In a failover situation, repmgrd will check if any servers in the same location as the current primary node are visible. If not, repmgrd will assume a network interruption and not promote any node in any other location (it will however enter degraded monitoring mode until a primary becomes visible). repmgrd failover validation failover validation Failover validation From repmgr 4.3, &repmgr; makes it possible to provide a script to repmgrd which, in a failover situation, will be executed by the promotion candidate (the node which has been selected to be the new primary) to confirm whether the node should actually be promoted. To use this, in repmgr.conf to a script executable by the postgres system user, e.g.: failover_validate_command=/path/to/script.sh %n %a The %n parameter will be replaced with the node ID, and the %a parameter will be replaced by the node name when the script is executed. This script must return an exit code of 0 to indicate the node should promote itself. Any other value will result in the promotion being aborted and the election rerun. There is a pause of seconds before the election is rerun. Sample repmgrd log file output during which the failover validation script rejects the proposed promotion candidate: [2019-03-13 21:01:30] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds [2019-03-13 21:01:30] [NOTICE] promotion candidate is "node2" (ID: 2) [2019-03-13 21:01:30] [NOTICE] executing "failover_validation_command" [2019-03-13 21:01:30] [DETAIL] /usr/local/bin/failover-validation.sh 2 [2019-03-13 21:01:30] [INFO] output returned by failover validation command: Node ID: 2 [2019-03-13 21:01:30] [NOTICE] failover validation command returned a non-zero value: "1" [2019-03-13 21:01:30] [NOTICE] promotion candidate election will be rerun [2019-03-13 21:01:30] [INFO] 1 followers to notify [2019-03-13 21:01:30] [NOTICE] notifying node "node3" (node ID: 3) to rerun promotion candidate selection INFO: node 3 received notification to rerun promotion candidate election [2019-03-13 21:01:30] [NOTICE] rerunning election after 15 seconds ("election_rerun_interval") repmgrd cascading replication cascading replication repmgrd repmgrd and cascading replication Cascading replication - where a standby can connect to an upstream node and not the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and repmgrd support cascading replication by keeping track of the relationship between standby servers - each node record is stored with the node id of its upstream ("parent") server (except of course the primary server). In a failover situation where the primary node fails and a top-level standby is promoted, a standby connected to another standby will not be affected and continue working as normal (even if the upstream standby it's connected to becomes the primary node). If however the node's direct upstream fails, the "cascaded standby" will attempt to reconnect to that node's parent (unless failover is set to manual in repmgr.conf).