repmgr/doc/repmgrd-automatic-failover.sgml

<chapter id="repmgrd-automatic-failover" xreflabel="Automatic failover with repmgrd">
 <indexterm>
   <primary>repmgrd</primary>
   <secondary>automatic failover</secondary>
 </indexterm>

 <title>Automatic failover with repmgrd</title>

 <para>
  <application>repmgrd</application> is a management and monitoring daemon which runs
  on each node in a replication cluster. It can automate actions such as
  failover and updating standbys to follow the new primary, as well as
  providing monitoring information about the state of each standby.
 </para>

<sect1 id="repmgrd-witness-server" xreflabel="Using a witness server with repmgrd">
 <indexterm>
   <primary>repmgrd</primary>
   <secondary>witness server</secondary>
 </indexterm>

 <indexterm>
   <primary>witness server</primary>
   <secondary>repmgrd</secondary>
 </indexterm>

 <title>Using a witness server with repmgrd</title>
 <para>
   In a situation caused e.g. by a network interruption between two
   data centres, it's important to avoid a &quot;split-brain&quot; situation where
   both sides of the network assume they are the active segment and the
   side without an active primary unilaterally promotes one of its standbys.
 </para>
 <para>
   To prevent this situation happening, it's essential to ensure that one
   network segment has a &quot;voting majority&quot;, so other segments will know
   they're in the minority and not attempt to promote a new primary. Where
   an odd number of servers exists, this is not an issue. However, if each
   network has an even number of nodes, it's necessary to provide some way
   of ensuring a majority, which is where the witness server becomes useful.
 </para>
 <para>
   This is not a fully-fledged standby node and is not integrated into
   replication, but it effectively represents the &quot;casting vote&quot; when
   deciding which network segment has a majority. A witness server can
   be set up using <link linkend="repmgr-witness-register"><command>repmgr witness register</command></link>;
   see also section <link linkend="using-witness-server">Using a witness server</link>.
 </para>
 <note>
   <para>
     It only
     makes sense to create a witness server in conjunction with running
     <application>repmgrd</application>; the witness server will require its own
     <application>repmgrd</application> instance.
   </para>
 </note>

</sect1>


<sect1 id="repmgrd-network-split" xreflabel="Handling network splits with repmgrd">
 <indexterm>
   <primary>repmgrd</primary>
   <secondary>network splits</secondary>
 </indexterm>

 <indexterm>
   <primary>network splits</primary>
 </indexterm>

 <title>Handling network splits with repmgrd</title>
 <para>
  A common pattern for replication cluster setups is to spread servers over
  more than one datacentre. This can provide benefits such as geographically-
  distributed read replicas and DR (disaster recovery capability). However
  this also means there is a risk of disconnection at network level between
  datacentre locations, which would result in a split-brain scenario if
  servers in a secondary data centre were no longer able to see the primary
  in the main data centre and promoted a standby among themselves.
 </para>
 <para>
  &repmgr; enables provision of &quot;<xref linkend="witness-server">&quot; to
  artificially create a quorum of servers in a particular location, ensuring
  that nodes in another location will not elect a new primary if they
  are unable to see the majority of nodes. However this approach does not
  scale well, particularly with more complex replication setups, e.g.
  where the majority of nodes are located outside of the primary datacentre.
  It also means the <literal>witness</literal> node needs to be managed as an
  extra PostgreSQL instance outside of the main replication cluster, which
  adds administrative and programming complexity.
 </para>
 <para>
  <literal>repmgr4</literal> introduces the concept of <literal>location</literal>:
  each node is associated with an arbitrary location string (default is
  <literal>default</literal>); this is set in <filename>repmgr.conf</filename>, e.g.:
  <programlisting>
    node_id=1
    node_name=node1
    conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
    data_directory='/var/lib/postgresql/data'
    location='dc1'</programlisting>
 </para>
 <para>
  In a failover situation, <application>repmgrd</application> will check if any servers in the
  same location as the current primary node are visible.  If not, <application>repmgrd</application>
  will assume a network interruption and not promote any node in any
  other location (it will however enter <link linkend="repmgrd-degraded-monitoring">degraded monitoring</link>
  mode until a primary becomes visible).
 </para>

</sect1>

<sect1 id="repmgrd-failover-validation" xreflabel="Failover validation">
  <indexterm>
   <primary>repmgrd</primary>
   <secondary>failover validation</secondary>
 </indexterm>

  <indexterm>
    <primary>failover validation</primary>
  </indexterm>

  <title>Failover validation</title>
  <para>
    From <link linkend="release-4.3">repmgr 4.3</link>, &repmgr; makes it possible to provide a script
    to <application>repmgrd</application> which, in a failover situation,
    will be executed by the promotion candidate (the node which has been selected
    to be the new primary) to confirm whether the node should actually be promoted.
  </para>
  <para>
    To use this, <option>failover_validation_command</option> in <filename>repmgr.conf</filename>
    to a script executable by the <literal>postgres</literal> system user, e.g.:
    <programlisting>
      failover_validation_command=/path/to/script.sh %n %a</programlisting>
  </para>
  <para>
    The <literal>%n</literal> parameter will be replaced with the node ID, and the
    <literal>%a</literal> parameter will be replaced by the node name when the script is executed.
  </para>
  <para>
    This script must return an exit code of <literal>0</literal> to indicate the node should promote itself.
    Any other value will result in the promotion being aborted and the election rerun.
    There is a pause of <option>election_rerun_interval</option> seconds before the election is rerun.
  </para>
  <para>
    Sample <application>repmgrd</application> log file output during which the failover validation
    script rejects the proposed promotion candidate:
    <programlisting>
[2019-03-13 21:01:30] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
[2019-03-13 21:01:30] [NOTICE] promotion candidate is "node2" (ID: 2)
[2019-03-13 21:01:30] [NOTICE] executing "failover_validation_command"
[2019-03-13 21:01:30] [DETAIL] /usr/local/bin/failover-validation.sh 2
[2019-03-13 21:01:30] [INFO] output returned by failover validation command:
Node ID: 2

[2019-03-13 21:01:30] [NOTICE] failover validation command returned a non-zero value: "1"
[2019-03-13 21:01:30] [NOTICE] promotion candidate election will be rerun
[2019-03-13 21:01:30] [INFO] 1 followers to notify
[2019-03-13 21:01:30] [NOTICE] notifying node "node3" (node ID: 3) to rerun promotion candidate selection
INFO:  node 3 received notification to rerun promotion candidate election
[2019-03-13 21:01:30] [NOTICE] rerunning election after 15 seconds ("election_rerun_interval")</programlisting>
  </para>


</sect1>

  <sect1 id="cascading-replication" xreflabel="Cascading replication">
 <indexterm>
   <primary>repmgrd</primary>
   <secondary>cascading replication</secondary>
 </indexterm>

 <indexterm>
   <primary>cascading replication</primary>
   <secondary>repmgrd</secondary>
 </indexterm>

 <title>repmgrd and cascading replication</title>
 <para>
  Cascading replication - where a standby can connect to an upstream node and not
  the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and
  <application>repmgrd</application> support cascading replication by keeping track of the relationship
  between standby servers - each node record is stored with the node id of its
  upstream ("parent") server (except of course the primary server).
 </para>
 <para>
  In a failover situation where the primary node fails and a top-level standby
  is promoted, a standby connected to another standby will not be affected
  and continue working as normal (even if the upstream standby it's connected
  to becomes the primary node). If however the node's direct upstream fails,
  the &quot;cascaded standby&quot; will attempt to reconnect to that node's parent
  (unless <varname>failover</varname> is set to <literal>manual</literal> in
  <filename>repmgr.conf</filename>).
 </para>

  </sect1>


</chapter>