repmgrdautomatic failoverAutomatic failover with repmgrdrepmgrd is a management and monitoring daemon which runs
on each node in a replication cluster. It can automate actions such as
failover and updating standbys to follow the new primary, as well as
providing monitoring information about the state of each standby.
repmgrdwitness serverwitness serverrepmgrdUsing a witness server
A is a normal PostgreSQL instance which
is not part of the streaming replication cluster; its purpose is, if a
failover situation occurs, to provide proof that it is the primary server
itself which is unavailable, rather than e.g. a network split between
different physical locations.
A typical use case for a witness server is a two-node streaming replication
setup, where the primary and standby are in different locations (data centres).
By creating a witness server in the same location (data centre) as the primary,
if the primary becomes unavailable it's possible for the standby to decide whether
it can promote itself without risking a "split brain" scenario: if it can't see either the
witness or the primary server, it's likely there's a network-level interruption
and it should not promote itself. If it can see the witness but not the primary,
this proves there is no network interruption and the primary itself is unavailable,
and it can therefore promote itself (and ideally take action to fence the
former primary).
Never install a witness server on the same physical host
as another node in the replication cluster managed by &repmgr; - it's essential
the witness is not affected in any way by failure of another node.
For more complex replication scenarios,e.g. with multiple datacentres, it may
be preferable to use location-based failover, which ensures that only nodes
in the same location as the primary will ever be promotion candidates;
see for more details.
A witness server will only be useful if repmgrd
is in use.
Creating a witness server
To create a witness server, set up a normal PostgreSQL instance on a server
in the same physical location as the cluster's primary server.
This instance should not be on the same physical host as the primary server,
as otherwise if the primary server fails due to hardware issues, the witness
server will be lost too.
&repmgr; 3.3 and earlier provided a repmgr create witness
command, which would automatically create a PostgreSQL instance. However
this often resulted in an unsatisfactory, hard-to-customise instance.
The witness server should be configured in the same way as a normal
&repmgr; node; see section .
Register the witness server with .
This will create the &repmgr; extension on the witness server, and make
a copy of the &repmgr; metadata.
As the witness server is not part of the replication cluster, further
changes to the &repmgr; metadata will be synchronised by
repmgrd.
Once the witness server has been configured, repmgrd
should be started.
To unregister a witness server, use .
repmgrdnetwork splitsnetwork splitsHandling network splits with repmgrd
A common pattern for replication cluster setups is to spread servers over
more than one datacentre. This can provide benefits such as geographically-
distributed read replicas and DR (disaster recovery capability). However
this also means there is a risk of disconnection at network level between
datacentre locations, which would result in a split-brain scenario if
servers in a secondary data centre were no longer able to see the primary
in the main data centre and promoted a standby among themselves.
&repmgr; enables provision of "" to
artificially create a quorum of servers in a particular location, ensuring
that nodes in another location will not elect a new primary if they
are unable to see the majority of nodes. However this approach does not
scale well, particularly with more complex replication setups, e.g.
where the majority of nodes are located outside of the primary datacentre.
It also means the witness node needs to be managed as an
extra PostgreSQL instance outside of the main replication cluster, which
adds administrative and programming complexity.
repmgr4 introduces the concept of location:
each node is associated with an arbitrary location string (default is
default); this is set in repmgr.conf, e.g.:
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/data'
location='dc1'
In a failover situation, repmgrd will check if any servers in the
same location as the current primary node are visible. If not, repmgrd
will assume a network interruption and not promote any node in any
other location (it will however enter degraded monitoring
mode until a primary becomes visible).
repmgrdstandby disconnection on failoverstandby disconnection on failoverStandby disconnection on failover
If is set to true in
repmgr.conf, in a failover situation repmgrd will forcibly disconnect
the local node's WAL receiver before making a failover decision.
is available from PostgreSQL 9.5 and later.
Additionally this requires that the repmgr database user is a superuser.
By doing this, it's possible to ensure that, at the point the failover decision is made, no nodes
are receiving data from the primary and their LSN location will be static.
must be set to the same value on
all nodes.
Note that when using there will be a delay of 5 seconds
plus however many seconds it takes to confirm the WAL receiver is disconnected before
repmgrd proceeds with the failover decision.
Following the failover operation, no matter what the outcome, each node will reconnect its WAL receiver.
repmgrdfailover validationfailover validationFailover validation
From repmgr 4.3, &repmgr; makes it possible to provide a script
to repmgrd which, in a failover situation,
will be executed by the promotion candidate (the node which has been selected
to be the new primary) to confirm whether the node should actually be promoted.
To use this, in repmgr.conf
to a script executable by the postgres system user, e.g.:
failover_validation_command=/path/to/script.sh %n %a
The %n parameter will be replaced with the node ID, and the
%a parameter will be replaced by the node name when the script is executed.
This script must return an exit code of 0 to indicate the node should promote itself.
Any other value will result in the promotion being aborted and the election rerun.
There is a pause of seconds before the election is rerun.
Sample repmgrd log file output during which the failover validation
script rejects the proposed promotion candidate:
[2019-03-13 21:01:30] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
[2019-03-13 21:01:30] [NOTICE] promotion candidate is "node2" (ID: 2)
[2019-03-13 21:01:30] [NOTICE] executing "failover_validation_command"
[2019-03-13 21:01:30] [DETAIL] /usr/local/bin/failover-validation.sh 2
[2019-03-13 21:01:30] [INFO] output returned by failover validation command:
Node ID: 2
[2019-03-13 21:01:30] [NOTICE] failover validation command returned a non-zero value: "1"
[2019-03-13 21:01:30] [NOTICE] promotion candidate election will be rerun
[2019-03-13 21:01:30] [INFO] 1 followers to notify
[2019-03-13 21:01:30] [NOTICE] notifying node "node3" (node ID: 3) to rerun promotion candidate selection
INFO: node 3 received notification to rerun promotion candidate election
[2019-03-13 21:01:30] [NOTICE] rerunning election after 15 seconds ("election_rerun_interval")repmgrdcascading replicationcascading replicationrepmgrdrepmgrd and cascading replication
Cascading replication - where a standby can connect to an upstream node and not
the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and
repmgrd support cascading replication by keeping track of the relationship
between standby servers - each node record is stored with the node id of its
upstream ("parent") server (except of course the primary server).
In a failover situation where the primary node fails and a top-level standby
is promoted, a standby connected to another standby will not be affected
and continue working as normal (even if the upstream standby it's connected
to becomes the primary node). If however the node's direct upstream fails,
the "cascaded standby" will attempt to reconnect to that node's parent
(unless failover is set to manual in
repmgr.conf).