Automatic failover with repmgrd

Automatic failover with repmgrd repmgrd automatic failover &repmgrd; is a management and monitoring daemon which runs on each node in a replication cluster. It can automate actions such as failover and updating standbys to follow the new primary, as well as providing monitoring information about the state of each standby. Using a witness server repmgrd witness server witness server repmgrd A is a normal PostgreSQL instance which is not part of the streaming replication cluster; its purpose is, if a failover situation occurs, to provide proof that it is the primary server itself which is unavailable, rather than e.g. a network split between different physical locations. A typical use case for a witness server is a two-node streaming replication setup, where the primary and standby are in different locations (data centres). By creating a witness server in the same location (data centre) as the primary, if the primary becomes unavailable it's possible for the standby to decide whether it can promote itself without risking a "split brain" scenario: if it can't see either the witness or the primary server, it's likely there's a network-level interruption and it should not promote itself. If it can see the witness but not the primary, this proves there is no network interruption and the primary itself is unavailable, and it can therefore promote itself (and ideally take action to fence the former primary). Never install a witness server on the same physical host as another node in the replication cluster managed by &repmgr; - it's essential the witness is not affected in any way by failure of another node. For more complex replication scenarios, e.g. with multiple datacentres, it may be preferable to use location-based failover, which ensures that only nodes in the same location as the primary will ever be promotion candidates; see for more details. A witness server will only be useful if &repmgrd; is in use. Creating a witness server To create a witness server, set up a normal PostgreSQL instance on a server in the same physical location as the cluster's primary server. This instance should not be on the same physical host as the primary server, as otherwise if the primary server fails due to hardware issues, the witness server will be lost too. A PostgreSQL instance can only accommodate a single witness server. If you are planning to use a single server to support more than one witness server, a separate PostgreSQL instance is required for each witness server in use. The witness server should be configured in the same way as a normal &repmgr; node; see section . Register the witness server with . This will create the &repmgr; extension on the witness server, and make a copy of the &repmgr; metadata. As the witness server is not part of the replication cluster, further changes to the &repmgr; metadata will be synchronised by &repmgrd;. Once the witness server has been configured, &repmgrd; should be started. To unregister a witness server, use . Handling network splits with repmgrd repmgrd network splits network splits A common pattern for replication cluster setups is to spread servers over more than one datacentre. This can provide benefits such as geographically- distributed read replicas and DR (disaster recovery capability). However this also means there is a risk of disconnection at network level between datacentre locations, which would result in a split-brain scenario if servers in a secondary data centre were no longer able to see the primary in the main data centre and promoted a standby among themselves. &repmgr; enables provision of "" to artificially create a quorum of servers in a particular location, ensuring that nodes in another location will not elect a new primary if they are unable to see the majority of nodes. However this approach does not scale well, particularly with more complex replication setups, e.g. where the majority of nodes are located outside of the primary datacentre. It also means the witness node needs to be managed as an extra PostgreSQL instance outside of the main replication cluster, which adds administrative and programming complexity. repmgr4 introduces the concept of location: each node is associated with an arbitrary location string (default is default); this is set in repmgr.conf, e.g.: node_id=1 node_name=node1 conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2' data_directory='/var/lib/postgresql/data' location='dc1' In a failover situation, &repmgrd; will check if any servers in the same location as the current primary node are visible. If not, &repmgrd; will assume a network interruption and not promote any node in any other location (it will however enter degraded monitoring mode until a primary becomes visible). Primary visibility consensus repmgrd primary visibility consensus primary_visibility_consensus In more complex replication setups, particularly where replication occurs between multiple datacentres, it's possible that some but not all standbys get cut off from the primary (but not from the other standbys). In this situation, normally it's not desirable for any of the standbys which have been cut off to initiate a failover, as the primary is still functioning and standbys are connected. Beginning with &repmgr; 4.4 it is now possible for the affected standbys to build a consensus about whether the primary is still available to some standbys ("primary visibility consensus"). This is done by polling each standby (and the witness, if present) for the time it last saw the primary; if any have seen the primary very recently, it's reasonable to infer that the primary is still available and a failover should not be started. The time the primary was last seen by each node can be checked by executing repmgr service status (&repmgr; 4.2 - 4.4: repmgr daemon status) which includes this in its output, e.g.: $ repmgr -f /etc/repmgr.conf service status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+-------+---------+-----------+----------+---------+-------+---------+-------------------- 1 | node1 | primary | * running | | running | 27259 | no | n/a 2 | node2 | standby | running | node1 | running | 27272 | no | 1 second(s) ago 3 | node3 | standby | running | node1 | running | 27282 | no | 0 second(s) ago 4 | node4 | witness | * running | node1 | running | 27298 | no | 1 second(s) ago To enable this functionality, in repmgr.conf set: primary_visibility_consensus=true must be set to true on all nodes for it to be effective. The following sample &repmgrd; log output demonstrates the behaviour in a situation where one of three standbys is no longer able to connect to the primary, but can connect to the two other standbys ("sibling nodes"): [2019-05-17 05:36:12] [WARNING] unable to reconnect to node 1 after 3 attempts [2019-05-17 05:36:12] [INFO] 2 active sibling nodes registered [2019-05-17 05:36:12] [INFO] local node's last receive lsn: 0/7006E58 [2019-05-17 05:36:12] [INFO] checking state of sibling node "node3" (ID: 3) [2019-05-17 05:36:12] [INFO] node "node3" (ID: 3) reports its upstream is node 1, last seen 1 second(s) ago [2019-05-17 05:36:12] [NOTICE] node 3 last saw primary node 1 second(s) ago, considering primary still visible [2019-05-17 05:36:12] [INFO] last receive LSN for sibling node "node3" (ID: 3) is: 0/7006E58 [2019-05-17 05:36:12] [INFO] node "node3" (ID: 3) has same LSN as current candidate "node2" (ID: 2) [2019-05-17 05:36:12] [INFO] checking state of sibling node "node4" (ID: 4) [2019-05-17 05:36:12] [INFO] node "node4" (ID: 4) reports its upstream is node 1, last seen 0 second(s) ago [2019-05-17 05:36:12] [NOTICE] node 4 last saw primary node 0 second(s) ago, considering primary still visible [2019-05-17 05:36:12] [INFO] last receive LSN for sibling node "node4" (ID: 4) is: 0/7006E58 [2019-05-17 05:36:12] [INFO] node "node4" (ID: 4) has same LSN as current candidate "node2" (ID: 2) [2019-05-17 05:36:12] [INFO] 2 nodes can see the primary [2019-05-17 05:36:12] [DETAIL] following nodes can see the primary: - node "node3" (ID: 3): 1 second(s) ago - node "node4" (ID: 4): 0 second(s) ago [2019-05-17 05:36:12] [NOTICE] cancelling failover as some nodes can still see the primary [2019-05-17 05:36:12] [NOTICE] election cancelled [2019-05-17 05:36:14] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in degraded state In this situation it will cancel the failover and enter degraded monitoring node, waiting for the primary to reappear. Standby disconnection on failover repmgrd standby disconnection on failover standby disconnection on failover If is set to true in repmgr.conf, in a failover situation &repmgrd; will forcibly disconnect the local node's WAL receiver, and wait for the WAL receiver on all sibling nodes to be disconnected, before making a failover decision. is available with PostgreSQL 9.5 and later. Until PostgreSQL 14 this requires that the repmgr database user is a superuser. From PostgreSQL 15 a specific ALTER SYSTEM privilege can be granted to the repmgr database user with e.g. GRANT ALTER SYSTEM ON PARAMETER wal_retrieve_retry_interval TO repmgr. By doing this, it's possible to ensure that, at the point the failover decision is made, no nodes are receiving data from the primary and their LSN location will be static. must be set to the same value on all nodes. Note that when using there will be a delay of 5 seconds plus however many seconds it takes to confirm the WAL receiver is disconnected before &repmgrd; proceeds with the failover decision. &repmgrd; will wait up to seconds (default: 30) to confirm that the WAL receiver on all sibling nodes hase been disconnected before proceding with the failover operation. If the timeout is reached, the failover operation will go ahead anyway. Following the failover operation, no matter what the outcome, each node will reconnect its WAL receiver. If using , we recommend that the option is also used. Failover validation repmgrd failover validation failover validation From repmgr 4.3, &repmgr; makes it possible to provide a script to &repmgrd; which, in a failover situation, will be executed by the promotion candidate (the node which has been selected to be the new primary) to confirm whether the node should actually be promoted. To use this, in repmgr.conf to a script executable by the postgres system user, e.g.: failover_validation_command=/path/to/script.sh %n The %n parameter will be replaced with the node ID when the script is executed. A number of other parameters are also available, see section "" for details. This script must return an exit code of 0 to indicate the node should promote itself. Any other value will result in the promotion being aborted and the election rerun. There is a pause of seconds before the election is rerun. Sample &repmgrd; log file output during which the failover validation script rejects the proposed promotion candidate: [2019-03-13 21:01:30] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds [2019-03-13 21:01:30] [NOTICE] promotion candidate is "node2" (ID: 2) [2019-03-13 21:01:30] [NOTICE] executing "failover_validation_command" [2019-03-13 21:01:30] [DETAIL] /usr/local/bin/failover-validation.sh 2 [2019-03-13 21:01:30] [INFO] output returned by failover validation command: Node ID: 2 [2019-03-13 21:01:30] [NOTICE] failover validation command returned a non-zero value: "1" [2019-03-13 21:01:30] [NOTICE] promotion candidate election will be rerun [2019-03-13 21:01:30] [INFO] 1 followers to notify [2019-03-13 21:01:30] [NOTICE] notifying node "node3" (ID: 3) to rerun promotion candidate selection INFO: node 3 received notification to rerun promotion candidate election [2019-03-13 21:01:30] [NOTICE] rerunning election after 15 seconds ("election_rerun_interval") repmgrd and cascading replication repmgrd cascading replication cascading replication repmgrd Cascading replication - where a standby can connect to an upstream node and not the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and &repmgrd; support cascading replication by keeping track of the relationship between standby servers - each node record is stored with the node id of its upstream ("parent") server (except of course the primary server). In a failover situation where the primary node fails and a top-level standby is promoted, a standby connected to another standby will not be affected and continue working as normal (even if the upstream standby it's connected to becomes the primary node). If however the node's direct upstream fails, the "cascaded standby" will attempt to reconnect to that node's parent (unless failover is set to manual in repmgr.conf). Monitoring standby disconnections on the primary node repmgrd standby disconnection repmgrd child node disconnection This functionality is available in &repmgr; 4.4 and later. When running on the primary node, &repmgrd; can monitor connections and in particular disconnections by its attached child nodes (standbys, and if in use, the witness server), and optionally execute a custom command if certain criteria are met (such as the number of attached nodes falling to zero following a failover to a new primary); this command can be used for example to "fence" the node and ensure it is isolated from any applications attempting to access the replication cluster. Currently &repmgrd; can only detect disconnections of streaming replication standbys and cannot determine whether a standby has disconnected and fallen back to archive recovery. See section caveats below. Standby disconnections monitoring process and criteria &repmgrd; monitors attached child nodes and decides whether to invoke the user-defined command based on the following process and criteria: Every few seconds (defined by the configuration parameter child_nodes_check_interval; default: 5 seconds, a value of 0 disables this altogether), &repmgrd; queries the pg_stat_replication system view and compares the nodes present there against the list of nodes registered with &repmgr; which should be attached to the primary. If a witness server is in use, &repmgrd; connects to it and checks which upstream node it is following. If a child node (standby) is no longer present in pg_stat_replication, &repmgrd; notes the time it detected the node's absence, and additionally generates a child_node_disconnect event. If a witness server is in use, and it is no longer following the primary, or not reachable at all, &repmgrd; notes the time it detected the node's absence, and additionally generates a child_node_disconnect event. If a child node (standby) which was absent from pg_stat_replication reappears, &repmgrd; clears the time it detected the node's absence, and additionally generates a child_node_reconnect event. If a witness server is in use, which was previously not reachable or not following the primary node, has become reachable and is following the primary node, &repmgrd; clears the time it detected the node's absence, and additionally generates a child_node_reconnect event. If an entirely new child node (standby or witness) is detected, &repmgrd; adds it to its internal list and additionally generates a child_node_new_connect event. If the child_nodes_disconnect_command parameter is set in repmgr.conf, &repmgrd; will then loop through all child nodes. If it determines that insufficient child nodes are connected, and a minimum of child_nodes_disconnect_timeout seconds (default: 30) has elapsed since the last node became disconnected, &repmgrd; will then execute the child_nodes_disconnect_command script. By default, the child_nodes_disconnect_command will only be executed if all child nodes are disconnected. If child_nodes_connected_min_count is set, the child_nodes_disconnect_command script will be triggered if the number of connected child nodes falls below the specified value (e.g. if set to 2, the script will be triggered if only one child node is connected). Alternatively, if child_nodes_disconnect_min_count and more than that number of child nodes disconnects, the script will be triggered. By default, a witness node, if in use, will not be counted as a child node for the purposes of determining whether to execute child_nodes_disconnect_command. To enable the witness node to be counted as a child node, set child_nodes_connected_include_witness in repmgr.conf to true (and reload the configuration if &repmgrd; is running). Note that child nodes which are not attached when &repmgrd; starts will not be considered as missing, as &repmgrd; cannot know why they are not attached. Standby disconnections monitoring process example This example shows typical &repmgrd; log output from a three-node cluster (primary and two child nodes), with child_nodes_connected_min_count set to 2. &repmgrd; on the primary has started up, while two child nodes are being provisioned: [2019-04-24 15:25:33] [INFO] monitoring primary node "node1" (ID: 1) in normal state [2019-04-24 15:25:35] [NOTICE] new node "node2" (ID: 2) has connected [2019-04-24 15:25:35] [NOTICE] 1 (of 1) child nodes are connected, but at least 2 child nodes required [2019-04-24 15:25:35] [INFO] no child nodes have detached since repmgrd startup (...) [2019-04-24 15:25:44] [NOTICE] new node "node3" (ID: 3) has connected [2019-04-24 15:25:46] [INFO] monitoring primary node "node1" (ID: 1) in normal state (...) One of the child nodes has disconnected; &repmgrd; is now waiting child_nodes_disconnect_timeout seconds before executing child_nodes_disconnect_command: [2019-04-24 15:28:11] [INFO] monitoring primary node "node1" (ID: 1) in normal state [2019-04-24 15:28:17] [INFO] monitoring primary node "node1" (ID: 1) in normal state [2019-04-24 15:28:19] [NOTICE] node "node3" (ID: 3) has disconnected [2019-04-24 15:28:19] [NOTICE] 1 (of 2) child nodes are connected, but at least 2 child nodes required [2019-04-24 15:28:19] [INFO] most recently detached child node was 3 (ca. 0 seconds ago), not triggering "child_nodes_disconnect_command" [2019-04-24 15:28:19] [DETAIL] "child_nodes_disconnect_timeout" set To 30 seconds (...) child_nodes_disconnect_command is executed once: [2019-04-24 15:28:49] [INFO] most recently detached child node was 3 (ca. 30 seconds ago), triggering "child_nodes_disconnect_command" [2019-04-24 15:28:49] [INFO] "child_nodes_disconnect_command" is: "/usr/bin/fence-all-the-things.sh" [2019-04-24 15:28:51] [NOTICE] 1 (of 2) child nodes are connected, but at least 2 child nodes required [2019-04-24 15:28:51] [INFO] "child_nodes_disconnect_command" was previously executed, taking no action Standby disconnections monitoring caveats The following caveats should be considered if you are intending to use this functionality. If a child node is configured to use archive recovery, it's possible that the child node will disconnect from the primary node and fall back to archive recovery. In this case &repmgrd; will nevertheless register a node disconnection. &repmgr; relies on application_name in the child node's primary_conninfo string to be the same as the node name defined in the node's repmgr.conf file. Furthermore, this application_name must be unique across the replication cluster. If a custom application_name is used, or the application_name is not unique across the replication cluster, &repmgr; will not be able to reliably monitor child node connections. Standby disconnections monitoring process configuration The following parameters, set in repmgr.conf, control how child node disconnection monitoring operates. child_nodes_check_interval child_nodes_check_interval child node disconnection monitoring Interval (in seconds) after which &repmgrd; queries the pg_stat_replication system view and compares the nodes present there against the list of nodes registered with repmgr which should be attached to the primary. Default is 5 seconds, a value of 0 disables this check altogether. child_nodes_disconnect_command child_nodes_disconnect_command child node disconnection monitoring User-definable script to be executed when &repmgrd; determines that an insufficient number of child nodes are connected. By default the script is executed when no child nodes are executed, but the execution threshold can be modified by setting one of child_nodes_connected_min_count orchild_nodes_disconnect_min_count (see below). The child_nodes_disconnect_command script can be any user-defined script or program. It must be able to be executed by the system user under which the PostgreSQL server itself runs (usually postgres). If child_nodes_disconnect_command is not set, no action will be taken. If specified, the following format placeholder will be substituted when executing child_nodes_disconnect_command: ID of the node executing the child_nodes_disconnect_command script. The child_nodes_disconnect_command script will only be executed once while the criteria for its execution are met. If the criteria for its execution are no longer met (i.e. some child nodes have reconnected), it will be executed again if the criteria for its execution are met again. The child_nodes_disconnect_command script will not be executed if &repmgrd; is paused. child_nodes_disconnect_timeout child_nodes_disconnect_timeout child node disconnection monitoring If &repmgrd; determines that an insufficient number of child nodes are connected, it will wait for the specified number of seconds to execute the child_nodes_disconnect_command. Default: 30 seconds. child_nodes_connected_min_count child_nodes_connected_min_count child node disconnection monitoring If the number of child nodes connected falls below the number specified in this parameter, the child_nodes_disconnect_command script will be executed. For example, if child_nodes_connected_min_count is set to 2, the child_nodes_disconnect_command script will be executed if one or no child nodes are connected. Note that child_nodes_connected_min_count overrides any value set in child_nodes_disconnect_min_count. If neither of child_nodes_connected_min_count or child_nodes_disconnect_min_count are set, the child_nodes_disconnect_command script will be executed when no child nodes are connected. A witness node, if in use, will not be counted as a child node unless child_nodes_connected_include_witness is set to true. child_nodes_disconnect_min_count child_nodes_disconnect_min_count child node disconnection monitoring If the number of disconnected child nodes exceeds the number specified in this parameter, the child_nodes_disconnect_command script will be executed. For example, if child_nodes_disconnect_min_count is set to 2, the child_nodes_disconnect_command script will be executed if more than two child nodes are disconnected. Note that any value set in child_nodes_disconnect_min_count will be overriden by child_nodes_connected_min_count. If neither of child_nodes_connected_min_count or child_nodes_disconnect_min_count are set, the child_nodes_disconnect_command script will be executed when no child nodes are connected. A witness node, if in use, will not be counted as a child node unless child_nodes_connected_include_witness is set to true. child_nodes_connected_include_witness child_nodes_connected_include_witness child node disconnection monitoring Whether to count the witness node (if in use) as a child node when determining whether to execute child_nodes_disconnect_command. Default to false. Standby disconnections monitoring process event notifications The following event notifications may be generated: child_node_disconnect child_node_disconnect event notification This event is generated after &repmgrd; detects that a child node is no longer streaming from the primary node. Example: $ repmgr cluster event --event=child_node_disconnect Node ID | Name | Event | OK | Timestamp | Details ---------+-------+-----------------------+----+---------------------+-------------------------------------------- 1 | node1 | child_node_disconnect | t | 2019-04-24 12:41:36 | node "node3" (ID: 3) has disconnected child_node_reconnect child_node_reconnect event notification This event is generated after &repmgrd; detects that a child node has resumed streaming from the primary node. Example: $ repmgr cluster event --event=child_node_reconnect Node ID | Name | Event | OK | Timestamp | Details ---------+-------+----------------------+----+---------------------+------------------------------------------------------------ 1 | node1 | child_node_reconnect | t | 2019-04-24 12:42:19 | node "node3" (ID: 3) has reconnected after 42 seconds child_node_new_connect child_node_new_connect event notification This event is generated after &repmgrd; detects that a new child node has been registered with &repmgr; and has connected to the primary. Example: $ repmgr cluster event --event=child_node_new_connect Node ID | Name | Event | OK | Timestamp | Details ---------+-------+------------------------+----+---------------------+--------------------------------------------- 1 | node1 | child_node_new_connect | t | 2019-04-24 12:41:30 | new node "node3" (ID: 3) has connected child_nodes_disconnect_command child_nodes_disconnect_command event notification This event is generated after &repmgrd; detects that sufficient child nodes have been disconnected for a sufficient amount of time to trigger execution of the child_nodes_disconnect_command. Example: $ repmgr cluster event --event=child_nodes_disconnect_command Node ID | Name | Event | OK | Timestamp | Details ---------+-------+--------------------------------+----+---------------------+-------------------------------------------------------- 1 | node1 | child_nodes_disconnect_command | t | 2019-04-24 13:08:17 | "child_nodes_disconnect_command" successfully executed