Automatic failover with repmgrd

repmgrd automatic failover Automatic failover with repmgrd repmgrd is a management and monitoring daemon which runs on each node in a replication cluster. It can automate actions such as failover and updating standbys to follow the new primary, as well as providing monitoring information about the state of each standby. repmgrd witness server witness server repmgrd Using a witness server A is a normal PostgreSQL instance which is not part of the streaming replication cluster; its purpose is, if a failover situation occurs, to provide proof that it is the primary server itself which is unavailable, rather than e.g. a network split between different physical locations. A typical use case for a witness server is a two-node streaming replication setup, where the primary and standby are in different locations (data centres). By creating a witness server in the same location (data centre) as the primary, if the primary becomes unavailable it's possible for the standby to decide whether it can promote itself without risking a "split brain" scenario: if it can't see either the witness or the primary server, it's likely there's a network-level interruption and it should not promote itself. If it can see the witness but not the primary, this proves there is no network interruption and the primary itself is unavailable, and it can therefore promote itself (and ideally take action to fence the former primary). Never install a witness server on the same physical host as another node in the replication cluster managed by &repmgr; - it's essential the witness is not affected in any way by failure of another node. For more complex replication scenarios,e.g. with multiple datacentres, it may be preferable to use location-based failover, which ensures that only nodes in the same location as the primary will ever be promotion candidates; see for more details. A witness server will only be useful if repmgrd is in use. Creating a witness server To create a witness server, set up a normal PostgreSQL instance on a server in the same physical location as the cluster's primary server. This instance should not be on the same physical host as the primary server, as otherwise if the primary server fails due to hardware issues, the witness server will be lost too. &repmgr; 3.3 and earlier provided a repmgr create witness command, which would automatically create a PostgreSQL instance. However this often resulted in an unsatisfactory, hard-to-customise instance. The witness server should be configured in the same way as a normal &repmgr; node; see section . Register the witness server with . This will create the &repmgr; extension on the witness server, and make a copy of the &repmgr; metadata. As the witness server is not part of the replication cluster, further changes to the &repmgr; metadata will be synchronised by repmgrd. Once the witness server has been configured, repmgrd should be started. To unregister a witness server, use . repmgrd network splits network splits Handling network splits with repmgrd A common pattern for replication cluster setups is to spread servers over more than one datacentre. This can provide benefits such as geographically- distributed read replicas and DR (disaster recovery capability). However this also means there is a risk of disconnection at network level between datacentre locations, which would result in a split-brain scenario if servers in a secondary data centre were no longer able to see the primary in the main data centre and promoted a standby among themselves. &repmgr; enables provision of "" to artificially create a quorum of servers in a particular location, ensuring that nodes in another location will not elect a new primary if they are unable to see the majority of nodes. However this approach does not scale well, particularly with more complex replication setups, e.g. where the majority of nodes are located outside of the primary datacentre. It also means the witness node needs to be managed as an extra PostgreSQL instance outside of the main replication cluster, which adds administrative and programming complexity. repmgr4 introduces the concept of location: each node is associated with an arbitrary location string (default is default); this is set in repmgr.conf, e.g.: node_id=1 node_name=node1 conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2' data_directory='/var/lib/postgresql/data' location='dc1' In a failover situation, repmgrd will check if any servers in the same location as the current primary node are visible. If not, repmgrd will assume a network interruption and not promote any node in any other location (it will however enter degraded monitoring mode until a primary becomes visible). repmgrd standby disconnection on failover standby disconnection on failover Standby disconnection on failover If is set to true in repmgr.conf, in a failover situation repmgrd will forcibly disconnect the local node's WAL receiver before making a failover decision. is available from PostgreSQL 9.5 and later. Additionally this requires that the repmgr database user is a superuser. By doing this, it's possible to ensure that, at the point the failover decision is made, no nodes are receiving data from the primary and their LSN location will be static. must be set to the same value on all nodes. Note that when using there will be a delay of 5 seconds plus however many seconds it takes to confirm the WAL receiver is disconnected before repmgrd proceeds with the failover decision. Following the failover operation, no matter what the outcome, each node will reconnect its WAL receiver. repmgrd failover validation failover validation Failover validation From repmgr 4.3, &repmgr; makes it possible to provide a script to repmgrd which, in a failover situation, will be executed by the promotion candidate (the node which has been selected to be the new primary) to confirm whether the node should actually be promoted. To use this, in repmgr.conf to a script executable by the postgres system user, e.g.: failover_validation_command=/path/to/script.sh %n %a The %n parameter will be replaced with the node ID, and the %a parameter will be replaced by the node name when the script is executed. This script must return an exit code of 0 to indicate the node should promote itself. Any other value will result in the promotion being aborted and the election rerun. There is a pause of seconds before the election is rerun. Sample repmgrd log file output during which the failover validation script rejects the proposed promotion candidate: [2019-03-13 21:01:30] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds [2019-03-13 21:01:30] [NOTICE] promotion candidate is "node2" (ID: 2) [2019-03-13 21:01:30] [NOTICE] executing "failover_validation_command" [2019-03-13 21:01:30] [DETAIL] /usr/local/bin/failover-validation.sh 2 [2019-03-13 21:01:30] [INFO] output returned by failover validation command: Node ID: 2 [2019-03-13 21:01:30] [NOTICE] failover validation command returned a non-zero value: "1" [2019-03-13 21:01:30] [NOTICE] promotion candidate election will be rerun [2019-03-13 21:01:30] [INFO] 1 followers to notify [2019-03-13 21:01:30] [NOTICE] notifying node "node3" (node ID: 3) to rerun promotion candidate selection INFO: node 3 received notification to rerun promotion candidate election [2019-03-13 21:01:30] [NOTICE] rerunning election after 15 seconds ("election_rerun_interval") repmgrd cascading replication cascading replication repmgrd repmgrd and cascading replication Cascading replication - where a standby can connect to an upstream node and not the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and repmgrd support cascading replication by keeping track of the relationship between standby servers - each node record is stored with the node id of its upstream ("parent") server (except of course the primary server). In a failover situation where the primary node fails and a top-level standby is promoted, a standby connected to another standby will not be affected and continue working as normal (even if the upstream standby it's connected to becomes the primary node). If however the node's direct upstream fails, the "cascaded standby" will attempt to reconnect to that node's parent (unless failover is set to manual in repmgr.conf).