Automatic failover with repmgrdrepmgrdautomatic failover
&repmgrd; is a management and monitoring daemon which runs
on each node in a replication cluster. It can automate actions such as
failover and updating standbys to follow the new primary, as well as
providing monitoring information about the state of each standby.
Using a witness serverrepmgrdwitness serverwitness serverrepmgrd
A is a normal PostgreSQL instance which
is not part of the streaming replication cluster; its purpose is, if a
failover situation occurs, to provide proof that it is the primary server
itself which is unavailable, rather than e.g. a network split between
different physical locations.
A typical use case for a witness server is a two-node streaming replication
setup, where the primary and standby are in different locations (data centres).
By creating a witness server in the same location (data centre) as the primary,
if the primary becomes unavailable it's possible for the standby to decide whether
it can promote itself without risking a "split brain" scenario: if it can't see either the
witness or the primary server, it's likely there's a network-level interruption
and it should not promote itself. If it can see the witness but not the primary,
this proves there is no network interruption and the primary itself is unavailable,
and it can therefore promote itself (and ideally take action to fence the
former primary).
Never install a witness server on the same physical host
as another node in the replication cluster managed by &repmgr; - it's essential
the witness is not affected in any way by failure of another node.
For more complex replication scenarios, e.g. with multiple datacentres, it may
be preferable to use location-based failover, which ensures that only nodes
in the same location as the primary will ever be promotion candidates;
see for more details.
A witness server will only be useful if &repmgrd;
is in use.
Creating a witness server
To create a witness server, set up a normal PostgreSQL instance on a server
in the same physical location as the cluster's primary server.
This instance should not be on the same physical host as the primary server,
as otherwise if the primary server fails due to hardware issues, the witness
server will be lost too.
A PostgreSQL instance can only accommodate a single witness server.
If you are planning to use a single server to support more than one
witness server, a separate PostgreSQL instance is required for each
witness server in use.
The witness server should be configured in the same way as a normal
&repmgr; node; see section .
Register the witness server with .
This will create the &repmgr; extension on the witness server, and make
a copy of the &repmgr; metadata.
As the witness server is not part of the replication cluster, further
changes to the &repmgr; metadata will be synchronised by
&repmgrd;.
Once the witness server has been configured, &repmgrd;
should be started.
To unregister a witness server, use .
Handling network splits with repmgrdrepmgrdnetwork splitsnetwork splits
A common pattern for replication cluster setups is to spread servers over
more than one datacentre. This can provide benefits such as geographically-
distributed read replicas and DR (disaster recovery capability). However
this also means there is a risk of disconnection at network level between
datacentre locations, which would result in a split-brain scenario if
servers in a secondary data centre were no longer able to see the primary
in the main data centre and promoted a standby among themselves.
&repmgr; enables provision of "" to
artificially create a quorum of servers in a particular location, ensuring
that nodes in another location will not elect a new primary if they
are unable to see the majority of nodes. However this approach does not
scale well, particularly with more complex replication setups, e.g.
where the majority of nodes are located outside of the primary datacentre.
It also means the witness node needs to be managed as an
extra PostgreSQL instance outside of the main replication cluster, which
adds administrative and programming complexity.
repmgr4 introduces the concept of location:
each node is associated with an arbitrary location string (default is
default); this is set in repmgr.conf, e.g.:
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/data'
location='dc1'
In a failover situation, &repmgrd; will check if any servers in the
same location as the current primary node are visible. If not, &repmgrd;
will assume a network interruption and not promote any node in any
other location (it will however enter degraded monitoring
mode until a primary becomes visible).
Primary visibility consensusrepmgrdprimary visibility consensusprimary_visibility_consensus
In more complex replication setups, particularly where replication occurs between
multiple datacentres, it's possible that some but not all standbys get cut off from the
primary (but not from the other standbys).
In this situation, normally it's not desirable for any of the standbys which have been
cut off to initiate a failover, as the primary is still functioning and standbys are
connected. Beginning with &repmgr; 4.4
it is now possible for the affected standbys to build a consensus about whether
the primary is still available to some standbys ("primary visibility consensus").
This is done by polling each standby (and the witness, if present) for the time it last saw the
primary; if any have seen the primary very recently, it's reasonable
to infer that the primary is still available and a failover should not be started.
The time the primary was last seen by each node can be checked by executing
repmgr service status
(&repmgr; 4.2 - 4.4: repmgr daemon status)
which includes this in its output, e.g.:
$ repmgr -f /etc/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node1 | primary | * running | | running | 27259 | no | n/a
2 | node2 | standby | running | node1 | running | 27272 | no | 1 second(s) ago
3 | node3 | standby | running | node1 | running | 27282 | no | 0 second(s) ago
4 | node4 | witness | * running | node1 | running | 27298 | no | 1 second(s) ago
To enable this functionality, in repmgr.conf set:
primary_visibility_consensus=truemust be set to
true on all nodes for it to be effective.
The following sample &repmgrd; log output demonstrates the behaviour in a situation
where one of three standbys is no longer able to connect to the primary, but can
connect to the two other standbys ("sibling nodes"):
[2019-05-17 05:36:12] [WARNING] unable to reconnect to node 1 after 3 attempts
[2019-05-17 05:36:12] [INFO] 2 active sibling nodes registered
[2019-05-17 05:36:12] [INFO] local node's last receive lsn: 0/7006E58
[2019-05-17 05:36:12] [INFO] checking state of sibling node "node3" (ID: 3)
[2019-05-17 05:36:12] [INFO] node "node3" (ID: 3) reports its upstream is node 1, last seen 1 second(s) ago
[2019-05-17 05:36:12] [NOTICE] node 3 last saw primary node 1 second(s) ago, considering primary still visible
[2019-05-17 05:36:12] [INFO] last receive LSN for sibling node "node3" (ID: 3) is: 0/7006E58
[2019-05-17 05:36:12] [INFO] node "node3" (ID: 3) has same LSN as current candidate "node2" (ID: 2)
[2019-05-17 05:36:12] [INFO] checking state of sibling node "node4" (ID: 4)
[2019-05-17 05:36:12] [INFO] node "node4" (ID: 4) reports its upstream is node 1, last seen 0 second(s) ago
[2019-05-17 05:36:12] [NOTICE] node 4 last saw primary node 0 second(s) ago, considering primary still visible
[2019-05-17 05:36:12] [INFO] last receive LSN for sibling node "node4" (ID: 4) is: 0/7006E58
[2019-05-17 05:36:12] [INFO] node "node4" (ID: 4) has same LSN as current candidate "node2" (ID: 2)
[2019-05-17 05:36:12] [INFO] 2 nodes can see the primary
[2019-05-17 05:36:12] [DETAIL] following nodes can see the primary:
- node "node3" (ID: 3): 1 second(s) ago
- node "node4" (ID: 4): 0 second(s) ago
[2019-05-17 05:36:12] [NOTICE] cancelling failover as some nodes can still see the primary
[2019-05-17 05:36:12] [NOTICE] election cancelled
[2019-05-17 05:36:14] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in degraded state
In this situation it will cancel the failover and enter degraded monitoring node,
waiting for the primary to reappear.
Standby disconnection on failoverrepmgrdstandby disconnection on failoverstandby disconnection on failover
If is set to true in
repmgr.conf, in a failover situation &repmgrd; will forcibly disconnect
the local node's WAL receiver, and wait for the WAL receiver on all sibling nodes to be
disconnected, before making a failover decision.
is available with PostgreSQL 9.5 and later.
Until PostgreSQL 14 this requires that the repmgr database user is a superuser.
From PostgreSQL 15 a specific ALTER SYSTEM privilege can be granted to the repmgr database
user with e.g. GRANT ALTER SYSTEM ON PARAMETER wal_retrieve_retry_interval TO repmgr.
By doing this, it's possible to ensure that, at the point the failover decision is made, no nodes
are receiving data from the primary and their LSN location will be static.
must be set to the same value on
all nodes.
Note that when using there will be a delay of 5 seconds
plus however many seconds it takes to confirm the WAL receiver is disconnected before
&repmgrd; proceeds with the failover decision.
&repmgrd; will wait up to seconds (default:
30) to confirm that the WAL receiver on all sibling nodes hase been
disconnected before proceding with the failover operation. If the timeout is reached, the
failover operation will go ahead anyway.
Following the failover operation, no matter what the outcome, each node will reconnect its WAL receiver.
If using , we recommend that the
option is also used.
Failover validationrepmgrdfailover validationfailover validation
From repmgr 4.3, &repmgr; makes it possible to provide a script
to &repmgrd; which, in a failover situation,
will be executed by the promotion candidate (the node which has been selected
to be the new primary) to confirm whether the node should actually be promoted.
To use this, in repmgr.conf
to a script executable by the postgres system user, e.g.:
failover_validation_command=/path/to/script.sh %n
The %n parameter will be replaced with the node ID when the script is
executed. A number of other parameters are also available, see section
"" for details.
This script must return an exit code of 0 to indicate the node should promote itself.
Any other value will result in the promotion being aborted and the election rerun.
There is a pause of seconds before the election is rerun.
Sample &repmgrd; log file output during which the failover validation
script rejects the proposed promotion candidate:
[2019-03-13 21:01:30] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
[2019-03-13 21:01:30] [NOTICE] promotion candidate is "node2" (ID: 2)
[2019-03-13 21:01:30] [NOTICE] executing "failover_validation_command"
[2019-03-13 21:01:30] [DETAIL] /usr/local/bin/failover-validation.sh 2
[2019-03-13 21:01:30] [INFO] output returned by failover validation command:
Node ID: 2
[2019-03-13 21:01:30] [NOTICE] failover validation command returned a non-zero value: "1"
[2019-03-13 21:01:30] [NOTICE] promotion candidate election will be rerun
[2019-03-13 21:01:30] [INFO] 1 followers to notify
[2019-03-13 21:01:30] [NOTICE] notifying node "node3" (ID: 3) to rerun promotion candidate selection
INFO: node 3 received notification to rerun promotion candidate election
[2019-03-13 21:01:30] [NOTICE] rerunning election after 15 seconds ("election_rerun_interval")repmgrd and cascading replicationrepmgrdcascading replicationcascading replicationrepmgrd
Cascading replication - where a standby can connect to an upstream node and not
the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and
&repmgrd; support cascading replication by keeping track of the relationship
between standby servers - each node record is stored with the node id of its
upstream ("parent") server (except of course the primary server).
In a failover situation where the primary node fails and a top-level standby
is promoted, a standby connected to another standby will not be affected
and continue working as normal (even if the upstream standby it's connected
to becomes the primary node). If however the node's direct upstream fails,
the "cascaded standby" will attempt to reconnect to that node's parent
(unless failover is set to manual in
repmgr.conf).
Monitoring standby disconnections on the primary noderepmgrdstandby disconnectionrepmgrdchild node disconnection
This functionality is available in &repmgr; 4.4 and later.
When running on the primary node, &repmgrd; can
monitor connections and in particular disconnections by its attached
child nodes (standbys, and if in use, the witness server), and optionally
execute a custom command if certain criteria are met (such as the number of
attached nodes falling to zero following a failover to a new primary); this
command can be used for example to "fence" the node and ensure it
is isolated from any applications attempting to access the replication cluster.
Currently &repmgrd; can only detect disconnections
of streaming replication standbys and cannot determine whether a standby
has disconnected and fallen back to archive recovery.
See section caveats below.
Standby disconnections monitoring process and criteria
&repmgrd; monitors attached child nodes and decides
whether to invoke the user-defined command based on the following process
and criteria:
Every few seconds (defined by the configuration parameter child_nodes_check_interval;
default: 5 seconds, a value of 0 disables this altogether), &repmgrd; queries
the pg_stat_replication system view and compares
the nodes present there against the list of nodes registered with &repmgr; which
should be attached to the primary.
If a witness server is in use, &repmgrd; connects to it and checks which upstream node
it is following.
If a child node (standby) is no longer present in pg_stat_replication,
&repmgrd; notes the time it detected the node's absence, and additionally generates a
child_node_disconnect event.
If a witness server is in use, and it is no longer following the primary, or not
reachable at all, &repmgrd; notes the time it detected the node's absence, and additionally generates a
child_node_disconnect event.
If a child node (standby) which was absent from pg_stat_replication reappears,
&repmgrd; clears the time it detected the node's absence, and additionally generates a
child_node_reconnect event.
If a witness server is in use, which was previously not reachable or not following the
primary node, has become reachable and is following the primary node, &repmgrd; clears the
time it detected the node's absence, and additionally generates a
child_node_reconnect event.
If an entirely new child node (standby or witness) is detected, &repmgrd; adds it to its internal list
and additionally generates a child_node_new_connect event.
If the child_nodes_disconnect_command parameter is set in
repmgr.conf, &repmgrd; will then loop through all child nodes.
If it determines that insufficient child nodes are connected, and a
minimum of child_nodes_disconnect_timeout seconds (default: 30)
has elapsed since the last node became disconnected, &repmgrd; will then execute the
child_nodes_disconnect_command script.
By default, the child_nodes_disconnect_command will only be executed
if all child nodes are disconnected. If child_nodes_connected_min_count
is set, the child_nodes_disconnect_command script will be triggered
if the number of connected child nodes falls below the specified value (e.g.
if set to 2, the script will be triggered if only one child node
is connected). Alternatively, if child_nodes_disconnect_min_count
and more than that number of child nodes disconnects, the script will be triggered.
By default, a witness node, if in use, will not be counted as a
child node for the purposes of determining whether to execute
child_nodes_disconnect_command.
To enable the witness node to be counted as a child node, set
child_nodes_connected_include_witness in repmgr.conf
to true
(and reload the configuration if &repmgrd;
is running).
Note that child nodes which are not attached when &repmgrd;
starts will not be considered as missing, as &repmgrd;
cannot know why they are not attached.
Standby disconnections monitoring process example
This example shows typical &repmgrd; log output from a three-node cluster
(primary and two child nodes), with child_nodes_connected_min_count
set to 2.
&repmgrd; on the primary has started up, while two child
nodes are being provisioned:
[2019-04-24 15:25:33] [INFO] monitoring primary node "node1" (ID: 1) in normal state
[2019-04-24 15:25:35] [NOTICE] new node "node2" (ID: 2) has connected
[2019-04-24 15:25:35] [NOTICE] 1 (of 1) child nodes are connected, but at least 2 child nodes required
[2019-04-24 15:25:35] [INFO] no child nodes have detached since repmgrd startup
(...)
[2019-04-24 15:25:44] [NOTICE] new node "node3" (ID: 3) has connected
[2019-04-24 15:25:46] [INFO] monitoring primary node "node1" (ID: 1) in normal state
(...)
One of the child nodes has disconnected; &repmgrd;
is now waiting child_nodes_disconnect_timeout seconds
before executing child_nodes_disconnect_command:
[2019-04-24 15:28:11] [INFO] monitoring primary node "node1" (ID: 1) in normal state
[2019-04-24 15:28:17] [INFO] monitoring primary node "node1" (ID: 1) in normal state
[2019-04-24 15:28:19] [NOTICE] node "node3" (ID: 3) has disconnected
[2019-04-24 15:28:19] [NOTICE] 1 (of 2) child nodes are connected, but at least 2 child nodes required
[2019-04-24 15:28:19] [INFO] most recently detached child node was 3 (ca. 0 seconds ago), not triggering "child_nodes_disconnect_command"
[2019-04-24 15:28:19] [DETAIL] "child_nodes_disconnect_timeout" set To 30 seconds
(...)child_nodes_disconnect_command is executed once:
[2019-04-24 15:28:49] [INFO] most recently detached child node was 3 (ca. 30 seconds ago), triggering "child_nodes_disconnect_command"
[2019-04-24 15:28:49] [INFO] "child_nodes_disconnect_command" is:
"/usr/bin/fence-all-the-things.sh"
[2019-04-24 15:28:51] [NOTICE] 1 (of 2) child nodes are connected, but at least 2 child nodes required
[2019-04-24 15:28:51] [INFO] "child_nodes_disconnect_command" was previously executed, taking no actionStandby disconnections monitoring caveats
The following caveats should be considered if you are intending to use this functionality.
If a child node is configured to use archive recovery, it's possible that
the child node will disconnect from the primary node and fall back to
archive recovery. In this case &repmgrd;
will nevertheless register a node disconnection.
&repmgr; relies on application_name in the child node's
primary_conninfo string to be the same as the node name
defined in the node's repmgr.conf file. Furthermore,
this application_name must be unique across the replication
cluster.
If a custom application_name is used, or the
application_name is not unique across the replication
cluster, &repmgr; will not be able to reliably monitor child node connections.
Standby disconnections monitoring process configuration
The following parameters, set in repmgr.conf,
control how child node disconnection monitoring operates.
child_nodes_check_intervalchild_nodes_check_intervalchild node disconnection monitoring
Interval (in seconds) after which &repmgrd; queries the
pg_stat_replication system view and compares the nodes present
there against the list of nodes registered with repmgr which should be attached to the primary.
Default is 5 seconds, a value of 0 disables this check
altogether.
child_nodes_disconnect_commandchild_nodes_disconnect_commandchild node disconnection monitoring
User-definable script to be executed when &repmgrd;
determines that an insufficient number of child nodes are connected. By default
the script is executed when no child nodes are executed, but the execution
threshold can be modified by setting one of child_nodes_connected_min_count
orchild_nodes_disconnect_min_count (see below).
The child_nodes_disconnect_command script can be
any user-defined script or program. It must be able
to be executed by the system user under which the PostgreSQL server itself
runs (usually postgres).
If child_nodes_disconnect_command is not set, no action
will be taken.
If specified, the following format placeholder will be substituted when
executing child_nodes_disconnect_command:
ID of the node executing the child_nodes_disconnect_command script.
The child_nodes_disconnect_command script will only be executed once
while the criteria for its execution are met. If the criteria for its execution are no longer
met (i.e. some child nodes have reconnected), it will be executed again if
the criteria for its execution are met again.
The child_nodes_disconnect_command script will not be executed if
&repmgrd; is paused.
child_nodes_disconnect_timeoutchild_nodes_disconnect_timeoutchild node disconnection monitoring
If &repmgrd; determines that an insufficient number of
child nodes are connected, it will wait for the specified number of seconds
to execute the child_nodes_disconnect_command.
Default: 30 seconds.
child_nodes_connected_min_countchild_nodes_connected_min_countchild node disconnection monitoring
If the number of child nodes connected falls below the number specified in
this parameter, the child_nodes_disconnect_command script
will be executed.
For example, if child_nodes_connected_min_count is set
to 2, the child_nodes_disconnect_command
script will be executed if one or no child nodes are connected.
Note that child_nodes_connected_min_count overrides any value
set in child_nodes_disconnect_min_count.
If neither of child_nodes_connected_min_count or
child_nodes_disconnect_min_count are set,
the child_nodes_disconnect_command script
will be executed when no child nodes are connected.
A witness node, if in use, will not be counted as a child node unless
child_nodes_connected_include_witness is set to true.
child_nodes_disconnect_min_countchild_nodes_disconnect_min_countchild node disconnection monitoring
If the number of disconnected child nodes exceeds the number specified in
this parameter, the child_nodes_disconnect_command script
will be executed.
For example, if child_nodes_disconnect_min_count is set
to 2, the child_nodes_disconnect_command
script will be executed if more than two child nodes are disconnected.
Note that any value set in child_nodes_disconnect_min_count
will be overriden by child_nodes_connected_min_count.
If neither of child_nodes_connected_min_count or
child_nodes_disconnect_min_count are set,
the child_nodes_disconnect_command script
will be executed when no child nodes are connected.
A witness node, if in use, will not be counted as a child node unless
child_nodes_connected_include_witness is set to true.
child_nodes_connected_include_witnesschild_nodes_connected_include_witnesschild node disconnection monitoring
Whether to count the witness node (if in use) as a child node when
determining whether to execute child_nodes_disconnect_command.
Default to false.
Standby disconnections monitoring process event notifications
The following event notifications may be generated:
child_node_disconnectchild_node_disconnectevent notification
This event is generated after &repmgrd;
detects that a child node is no longer streaming from the primary node.
Example:
$ repmgr cluster event --event=child_node_disconnect
Node ID | Name | Event | OK | Timestamp | Details
---------+-------+-----------------------+----+---------------------+--------------------------------------------
1 | node1 | child_node_disconnect | t | 2019-04-24 12:41:36 | node "node3" (ID: 3) has disconnectedchild_node_reconnectchild_node_reconnectevent notification
This event is generated after &repmgrd;
detects that a child node has resumed streaming from the primary node.
Example:
$ repmgr cluster event --event=child_node_reconnect
Node ID | Name | Event | OK | Timestamp | Details
---------+-------+----------------------+----+---------------------+------------------------------------------------------------
1 | node1 | child_node_reconnect | t | 2019-04-24 12:42:19 | node "node3" (ID: 3) has reconnected after 42 secondschild_node_new_connectchild_node_new_connectevent notification
This event is generated after &repmgrd;
detects that a new child node has been registered with &repmgr; and has
connected to the primary.
Example:
$ repmgr cluster event --event=child_node_new_connect
Node ID | Name | Event | OK | Timestamp | Details
---------+-------+------------------------+----+---------------------+---------------------------------------------
1 | node1 | child_node_new_connect | t | 2019-04-24 12:41:30 | new node "node3" (ID: 3) has connectedchild_nodes_disconnect_commandchild_nodes_disconnect_commandevent notification
This event is generated after &repmgrd; detects
that sufficient child nodes have been disconnected for a sufficient amount
of time to trigger execution of the child_nodes_disconnect_command.
Example:
$ repmgr cluster event --event=child_nodes_disconnect_command
Node ID | Name | Event | OK | Timestamp | Details
---------+-------+--------------------------------+----+---------------------+--------------------------------------------------------
1 | node1 | child_nodes_disconnect_command | t | 2019-04-24 13:08:17 | "child_nodes_disconnect_command" successfully executed