mirror of
https://github.com/EnterpriseDB/repmgr.git
synced 2026-03-22 22:56:29 +00:00
This functionality enables repmgrd (when running on the primary) to monitor connected child nodes. It will log connections and disconnections and generate events. Additionally, repmgrd can execute a custom script if the number of connected child nodes falls below a configurable threshold. This script can be used e.g. to "fence" the primary following a failover situation where a new primary has been promoted and all standbys are now child nodes of that primary.
407 lines
17 KiB
Plaintext
407 lines
17 KiB
Plaintext
<chapter id="repmgrd-automatic-failover" xreflabel="Automatic failover with repmgrd">
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>automatic failover</secondary>
|
|
</indexterm>
|
|
|
|
<title>Automatic failover with repmgrd</title>
|
|
|
|
<para>
|
|
<application>repmgrd</application> is a management and monitoring daemon which runs
|
|
on each node in a replication cluster. It can automate actions such as
|
|
failover and updating standbys to follow the new primary, as well as
|
|
providing monitoring information about the state of each standby.
|
|
</para>
|
|
|
|
<sect1 id="repmgrd-witness-server" xreflabel="Using a witness server with repmgrd">
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>witness server</secondary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>witness server</primary>
|
|
<secondary>repmgrd</secondary>
|
|
</indexterm>
|
|
<title>Using a witness server</title>
|
|
<para>
|
|
A <xref linkend="witness-server"> is a normal PostgreSQL instance which
|
|
is not part of the streaming replication cluster; its purpose is, if a
|
|
failover situation occurs, to provide proof that it is the primary server
|
|
itself which is unavailable, rather than e.g. a network split between
|
|
different physical locations.
|
|
</para>
|
|
|
|
<para>
|
|
A typical use case for a witness server is a two-node streaming replication
|
|
setup, where the primary and standby are in different locations (data centres).
|
|
By creating a witness server in the same location (data centre) as the primary,
|
|
if the primary becomes unavailable it's possible for the standby to decide whether
|
|
it can promote itself without risking a "split brain" scenario: if it can't see either the
|
|
witness or the primary server, it's likely there's a network-level interruption
|
|
and it should not promote itself. If it can see the witness but not the primary,
|
|
this proves there is no network interruption and the primary itself is unavailable,
|
|
and it can therefore promote itself (and ideally take action to fence the
|
|
former primary).
|
|
</para>
|
|
<note>
|
|
<para>
|
|
<emphasis>Never</emphasis> install a witness server on the same physical host
|
|
as another node in the replication cluster managed by &repmgr; - it's essential
|
|
the witness is not affected in any way by failure of another node.
|
|
</para>
|
|
</note>
|
|
<para>
|
|
For more complex replication scenarios,e.g. with multiple datacentres, it may
|
|
be preferable to use location-based failover, which ensures that only nodes
|
|
in the same location as the primary will ever be promotion candidates;
|
|
see <xref linkend="repmgrd-network-split"> for more details.
|
|
</para>
|
|
|
|
<note>
|
|
<simpara>
|
|
A witness server will only be useful if <application>repmgrd</application>
|
|
is in use.
|
|
</simpara>
|
|
</note>
|
|
|
|
<sect2 id="creating-witness-server">
|
|
<title>Creating a witness server</title>
|
|
<para>
|
|
To create a witness server, set up a normal PostgreSQL instance on a server
|
|
in the same physical location as the cluster's primary server.
|
|
</para>
|
|
<para>
|
|
This instance should <emphasis>not</emphasis> be on the same physical host as the primary server,
|
|
as otherwise if the primary server fails due to hardware issues, the witness
|
|
server will be lost too.
|
|
</para>
|
|
<note>
|
|
<simpara>
|
|
&repmgr; 3.3 and earlier provided a <command>repmgr create witness</command>
|
|
command, which would automatically create a PostgreSQL instance. However
|
|
this often resulted in an unsatisfactory, hard-to-customise instance.
|
|
</simpara>
|
|
</note>
|
|
<para>
|
|
The witness server should be configured in the same way as a normal
|
|
&repmgr; node; see section <xref linkend="configuration">.
|
|
</para>
|
|
<para>
|
|
Register the witness server with <xref linkend="repmgr-witness-register">.
|
|
This will create the &repmgr; extension on the witness server, and make
|
|
a copy of the &repmgr; metadata.
|
|
</para>
|
|
<note>
|
|
<simpara>
|
|
As the witness server is not part of the replication cluster, further
|
|
changes to the &repmgr; metadata will be synchronised by
|
|
<application>repmgrd</application>.
|
|
</simpara>
|
|
</note>
|
|
<para>
|
|
Once the witness server has been configured, <application>repmgrd</application>
|
|
should be started.
|
|
</para>
|
|
|
|
<para>
|
|
To unregister a witness server, use <xref linkend="repmgr-witness-unregister">.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="repmgrd-network-split" xreflabel="Handling network splits with repmgrd">
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>network splits</secondary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>network splits</primary>
|
|
</indexterm>
|
|
|
|
<title>Handling network splits with repmgrd</title>
|
|
<para>
|
|
A common pattern for replication cluster setups is to spread servers over
|
|
more than one datacentre. This can provide benefits such as geographically-
|
|
distributed read replicas and DR (disaster recovery capability). However
|
|
this also means there is a risk of disconnection at network level between
|
|
datacentre locations, which would result in a split-brain scenario if
|
|
servers in a secondary data centre were no longer able to see the primary
|
|
in the main data centre and promoted a standby among themselves.
|
|
</para>
|
|
<para>
|
|
&repmgr; enables provision of "<xref linkend="witness-server">" to
|
|
artificially create a quorum of servers in a particular location, ensuring
|
|
that nodes in another location will not elect a new primary if they
|
|
are unable to see the majority of nodes. However this approach does not
|
|
scale well, particularly with more complex replication setups, e.g.
|
|
where the majority of nodes are located outside of the primary datacentre.
|
|
It also means the <literal>witness</literal> node needs to be managed as an
|
|
extra PostgreSQL instance outside of the main replication cluster, which
|
|
adds administrative and programming complexity.
|
|
</para>
|
|
<para>
|
|
<literal>repmgr4</literal> introduces the concept of <literal>location</literal>:
|
|
each node is associated with an arbitrary location string (default is
|
|
<literal>default</literal>); this is set in <filename>repmgr.conf</filename>, e.g.:
|
|
<programlisting>
|
|
node_id=1
|
|
node_name=node1
|
|
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
|
|
data_directory='/var/lib/postgresql/data'
|
|
location='dc1'</programlisting>
|
|
</para>
|
|
<para>
|
|
In a failover situation, <application>repmgrd</application> will check if any servers in the
|
|
same location as the current primary node are visible. If not, <application>repmgrd</application>
|
|
will assume a network interruption and not promote any node in any
|
|
other location (it will however enter <link linkend="repmgrd-degraded-monitoring">degraded monitoring</link>
|
|
mode until a primary becomes visible).
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="repmgrd-primary-standby-disconnection" xreflabel="Monitoring standby disconnections on the primary">
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>standby disconnection</secondary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>child node disconnection</secondary>
|
|
</indexterm>
|
|
|
|
<title>Monitoring standby disconnections on the primary node</title>
|
|
|
|
<note>
|
|
<para>
|
|
This functionality is available in <link linkend="release-4.4">&repmgr 4.4</link> and later.
|
|
</para>
|
|
</note>
|
|
<para>
|
|
When running on the primary node, <application>repmgrd</application> can
|
|
monitor connections and in particular disconnections by its attached
|
|
child nodes (standbys), and optionally execute a custom command
|
|
if certain criteria are met (such as the number of attached nodes falling to
|
|
zero following a failover to a new primary); this command can be used for
|
|
example to "fence" the node and ensure it is isolated from any
|
|
applications attempting to access the replication cluster.
|
|
</para>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
<para>
|
|
Every few seconds (defined by the configuration parameter <varname>child_nodes_check_interval</varname>
|
|
(a value of <literal>0</literal> disables this altogether), <application>repmgrd</application> queries
|
|
the <literal>pg_stat_replication</literal> system view and compares
|
|
the nodes present there against the list of nodes registered with &repmgr; which
|
|
should be attached to the primary.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
If a child node (standby) is no longer present in <literal>pg_stat_replication</literal>,
|
|
<application>repmgrd</application> notes the time it detected the node's absence, and additionally generates a
|
|
<literal>child_node_disconnect</literal> event.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
If a chile node (standby) which was absent from <literal>pg_stat_replication</literal> reappears,
|
|
<application>repmgrd</application> clears the time it detected the node's absence, and additionally generates a
|
|
<literal>child_node_reconnect</literal> event.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
If an entirely new child node (standby) is detected, <application>repmgrd</application> adds it to its internal list
|
|
and additionally generates a <literal>child_node_new_connect</literal> event.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
If the <varname>child_nodes_disconnect_command</varname> parameter is set in
|
|
<filename>repmgr.conf</filename>, <application>repmgrd</application> will then loop through all child nodes.
|
|
If it determines that insufficient child nodes are connected, and a
|
|
minimum of <varname>child_nodes_disconnect_timeout</varname> seconds (default: <literal>30</literal>
|
|
has elapsed since the last node became disconnected, <application>repmgrd</application> will then execute the
|
|
<varname>child_nodes_disconnect_command</varname> script.
|
|
</para>
|
|
<para>
|
|
By default, the <varname>child_nodes_disconnect_command</varname> will only be executed
|
|
if all child nodes are disconnected. If <varname>child_nodes_connected_min_count</varname>
|
|
is set, the <varname>child_nodes_disconnect_command</varname> script will be triggered
|
|
if the number of connected child nodes falls below the specified value (e.g.
|
|
if set to <literal>2</literal>, the script will be triggered if only one child node
|
|
is connected). Alternatively, if <varname>child_nodes_disconnect_min_count</varname>
|
|
and more than that number of child nodes disconnects, the script will be triggered.
|
|
</para>
|
|
<para>
|
|
The <varname>child_nodes_disconnect_command</varname> script will only be executed once
|
|
while the criteria for its execution are met. If the criteria for its execution are no longer
|
|
met (i.e. some child nodes have reconnected), it will be executed again if
|
|
the criteria for its execution are met again.
|
|
</para>
|
|
<para>
|
|
The <varname>child_nodes_disconnect_command</varname> script will not be executed if <application>repmgrd</application> is paused.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Note that child nodes which are not attached when <application>repmgrd</application>
|
|
starts will <emphasis>not</emphasis> be considered as missing, as <application>repmgrd</application>
|
|
cannot know why they are not attached.
|
|
</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="repmgrd-standby-disconnection-on-failover" xreflabel="Standby disconnection on failover">
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>standby disconnection on failover</secondary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>standby disconnection on failover</primary>
|
|
</indexterm>
|
|
|
|
<title>Standby disconnection on failover</title>
|
|
<para>
|
|
If <option>standby_disconnect_on_failover</option> is set to <literal>true</literal> in
|
|
<filename>repmgr.conf</filename>, in a failover situation <application>repmgrd</application> will forcibly disconnect
|
|
the local node's WAL receiver before making a failover decision.
|
|
</para>
|
|
<note>
|
|
<para>
|
|
<option>standby_disconnect_on_failover</option> is available from PostgreSQL 9.5 and later.
|
|
Additionally this requires that the <literal>repmgr</literal> database user is a superuser.
|
|
</para>
|
|
</note>
|
|
<para>
|
|
By doing this, it's possible to ensure that, at the point the failover decision is made, no nodes
|
|
are receiving data from the primary and their LSN location will be static.
|
|
</para>
|
|
<important>
|
|
<para>
|
|
<option>standby_disconnect_on_failover</option> <emphasis>must</emphasis> be set to the same value on
|
|
all nodes.
|
|
</para>
|
|
</important>
|
|
<para>
|
|
Note that when using <option>standby_disconnect_on_failover</option> there will be a delay of 5 seconds
|
|
plus however many seconds it takes to confirm the WAL receiver is disconnected before
|
|
<application>repmgrd</application> proceeds with the failover decision.
|
|
</para>
|
|
<para>
|
|
Following the failover operation, no matter what the outcome, each node will reconnect its WAL receiver.
|
|
</para>
|
|
<para>
|
|
If using <option>standby_disconnect_on_failover</option>, we recommend that the
|
|
<option>primary_visibility_consensus</option> option is also used.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="repmgrd-failover-validation" xreflabel="Failover validation">
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>failover validation</secondary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>failover validation</primary>
|
|
</indexterm>
|
|
|
|
<title>Failover validation</title>
|
|
<para>
|
|
From <link linkend="release-4.3">repmgr 4.3</link>, &repmgr; makes it possible to provide a script
|
|
to <application>repmgrd</application> which, in a failover situation,
|
|
will be executed by the promotion candidate (the node which has been selected
|
|
to be the new primary) to confirm whether the node should actually be promoted.
|
|
</para>
|
|
<para>
|
|
To use this, <option>failover_validation_command</option> in <filename>repmgr.conf</filename>
|
|
to a script executable by the <literal>postgres</literal> system user, e.g.:
|
|
<programlisting>
|
|
failover_validation_command=/path/to/script.sh %n %a</programlisting>
|
|
</para>
|
|
<para>
|
|
The <literal>%n</literal> parameter will be replaced with the node ID, and the
|
|
<literal>%a</literal> parameter will be replaced by the node name when the script is executed.
|
|
</para>
|
|
<para>
|
|
This script must return an exit code of <literal>0</literal> to indicate the node should promote itself.
|
|
Any other value will result in the promotion being aborted and the election rerun.
|
|
There is a pause of <option>election_rerun_interval</option> seconds before the election is rerun.
|
|
</para>
|
|
<para>
|
|
Sample <application>repmgrd</application> log file output during which the failover validation
|
|
script rejects the proposed promotion candidate:
|
|
<programlisting>
|
|
[2019-03-13 21:01:30] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
|
|
[2019-03-13 21:01:30] [NOTICE] promotion candidate is "node2" (ID: 2)
|
|
[2019-03-13 21:01:30] [NOTICE] executing "failover_validation_command"
|
|
[2019-03-13 21:01:30] [DETAIL] /usr/local/bin/failover-validation.sh 2
|
|
[2019-03-13 21:01:30] [INFO] output returned by failover validation command:
|
|
Node ID: 2
|
|
|
|
[2019-03-13 21:01:30] [NOTICE] failover validation command returned a non-zero value: "1"
|
|
[2019-03-13 21:01:30] [NOTICE] promotion candidate election will be rerun
|
|
[2019-03-13 21:01:30] [INFO] 1 followers to notify
|
|
[2019-03-13 21:01:30] [NOTICE] notifying node "node3" (node ID: 3) to rerun promotion candidate selection
|
|
INFO: node 3 received notification to rerun promotion candidate election
|
|
[2019-03-13 21:01:30] [NOTICE] rerunning election after 15 seconds ("election_rerun_interval")</programlisting>
|
|
</para>
|
|
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="cascading-replication" xreflabel="Cascading replication">
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>cascading replication</secondary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>cascading replication</primary>
|
|
<secondary>repmgrd</secondary>
|
|
</indexterm>
|
|
|
|
<title>repmgrd and cascading replication</title>
|
|
<para>
|
|
Cascading replication - where a standby can connect to an upstream node and not
|
|
the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and
|
|
<application>repmgrd</application> support cascading replication by keeping track of the relationship
|
|
between standby servers - each node record is stored with the node id of its
|
|
upstream ("parent") server (except of course the primary server).
|
|
</para>
|
|
<para>
|
|
In a failover situation where the primary node fails and a top-level standby
|
|
is promoted, a standby connected to another standby will not be affected
|
|
and continue working as normal (even if the upstream standby it's connected
|
|
to becomes the primary node). If however the node's direct upstream fails,
|
|
the "cascaded standby" will attempt to reconnect to that node's parent
|
|
(unless <varname>failover</varname> is set to <literal>manual</literal> in
|
|
<filename>repmgr.conf</filename>).
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
</chapter>
|