repmgr/doc/repmgrd-automatic-failover.sgml

<chapter id="repmgrd-automatic-failover" xreflabel="Automatic failover with repmgrd">
 <indexterm>
   <primary>repmgrd</primary>
   <secondary>automatic failover</secondary>
 </indexterm>

 <title>Automatic failover with repmgrd</title>

 <para>
  <application>repmgrd</application> is a management and monitoring daemon which runs
  on each node in a replication cluster. It can automate actions such as
  failover and updating standbys to follow the new primary, as well as
  providing monitoring information about the state of each standby.
 </para>

<sect1 id="repmgrd-witness-server" xreflabel="Using a witness server with repmgrd">
 <indexterm>
   <primary>repmgrd</primary>
   <secondary>witness server</secondary>
 </indexterm>

 <indexterm>
   <primary>witness server</primary>
   <secondary>repmgrd</secondary>
 </indexterm>
 <title>Using a witness server</title>
 <para>
   A <xref linkend="witness-server"> is a normal PostgreSQL instance which
   is not part of the streaming replication cluster; its purpose is, if a
   failover situation occurs, to provide proof that it is the primary server
   itself which is unavailable, rather than e.g. a network split between
   different physical locations.
 </para>

 <para>
   A typical use case for a witness server is a two-node streaming replication
   setup, where the primary and standby are in different locations (data centres).
   By creating a witness server in the same location (data centre) as the primary,
   if the primary becomes unavailable it's possible for the standby to decide whether
   it can promote itself without risking a "split brain" scenario: if it can't see either the
   witness or the primary server, it's likely there's a network-level interruption
   and it should not promote itself. If it can see the witness but not the primary,
   this proves there is no network interruption and the primary itself is unavailable,
   and it can therefore promote itself (and ideally take action to fence the
   former primary).
 </para>
 <note>
   <para>
     <emphasis>Never</emphasis> install a witness server on the same physical host
     as another node in the replication cluster managed by &repmgr; - it's essential
     the witness is not affected in any way by failure of another node.
   </para>
 </note>
 <para>
   For more complex replication scenarios,e.g. with multiple datacentres, it may
   be preferable to use location-based failover, which ensures that only nodes
   in the same location as the primary will ever be promotion candidates;
   see <xref linkend="repmgrd-network-split"> for more details.
 </para>

 <note>
   <simpara>
     A witness server will only be useful if <application>repmgrd</application>
     is in use.
   </simpara>
 </note>

 <sect2 id="creating-witness-server">
   <title>Creating a witness server</title>
 <para>
   To create a witness server, set up a normal PostgreSQL instance on a server
   in the same physical location as the cluster's primary server.
 </para>
 <para>
   This instance should <emphasis>not</emphasis> be on the same physical host as the primary server,
   as otherwise if the primary server fails due to hardware issues, the witness
   server will be lost too.
 </para>
 <note>
   <simpara>
     &repmgr; 3.3 and earlier provided a <command>repmgr create witness</command>
     command, which would automatically create a PostgreSQL instance. However
     this often resulted in an unsatisfactory, hard-to-customise instance.
   </simpara>
 </note>
 <para>
   The witness server should be configured in the same way as a normal
   &repmgr; node; see section <xref linkend="configuration">.
 </para>
 <para>
   Register the witness server with <xref linkend="repmgr-witness-register">.
   This will create the &repmgr; extension on the witness server, and make
   a copy of the &repmgr; metadata.
 </para>
 <note>
   <simpara>
    As the witness server is not part of the replication cluster, further
    changes to the &repmgr; metadata will be synchronised by
    <application>repmgrd</application>.
   </simpara>
 </note>
 <para>
   Once the witness server has been configured, <application>repmgrd</application>
   should be started.
 </para>

 <para>
  To unregister a witness server, use <xref linkend="repmgr-witness-unregister">.
 </para>

 </sect2>

</sect1>


<sect1 id="repmgrd-network-split" xreflabel="Handling network splits with repmgrd">
 <indexterm>
   <primary>repmgrd</primary>
   <secondary>network splits</secondary>
 </indexterm>

 <indexterm>
   <primary>network splits</primary>
 </indexterm>

 <title>Handling network splits with repmgrd</title>
 <para>
  A common pattern for replication cluster setups is to spread servers over
  more than one datacentre. This can provide benefits such as geographically-
  distributed read replicas and DR (disaster recovery capability). However
  this also means there is a risk of disconnection at network level between
  datacentre locations, which would result in a split-brain scenario if
  servers in a secondary data centre were no longer able to see the primary
  in the main data centre and promoted a standby among themselves.
 </para>
 <para>
  &repmgr; enables provision of &quot;<xref linkend="witness-server">&quot; to
  artificially create a quorum of servers in a particular location, ensuring
  that nodes in another location will not elect a new primary if they
  are unable to see the majority of nodes. However this approach does not
  scale well, particularly with more complex replication setups, e.g.
  where the majority of nodes are located outside of the primary datacentre.
  It also means the <literal>witness</literal> node needs to be managed as an
  extra PostgreSQL instance outside of the main replication cluster, which
  adds administrative and programming complexity.
 </para>
 <para>
  <literal>repmgr4</literal> introduces the concept of <literal>location</literal>:
  each node is associated with an arbitrary location string (default is
  <literal>default</literal>); this is set in <filename>repmgr.conf</filename>, e.g.:
  <programlisting>
    node_id=1
    node_name=node1
    conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
    data_directory='/var/lib/postgresql/data'
    location='dc1'</programlisting>
 </para>
 <para>
  In a failover situation, <application>repmgrd</application> will check if any servers in the
  same location as the current primary node are visible.  If not, <application>repmgrd</application>
  will assume a network interruption and not promote any node in any
  other location (it will however enter <link linkend="repmgrd-degraded-monitoring">degraded monitoring</link>
  mode until a primary becomes visible).
 </para>

</sect1>

<sect1 id="repmgrd-primary-standby-disconnection" xreflabel="Monitoring standby disconnections on the primary">
  <indexterm>
   <primary>repmgrd</primary>
   <secondary>standby disconnection</secondary>
  </indexterm>

  <indexterm>
   <primary>repmgrd</primary>
   <secondary>child node disconnection</secondary>
  </indexterm>

  <title>Monitoring standby disconnections on the primary node</title>

  <note>
    <para>
      This functionality is available in <link linkend="release-4.4">&repmgr 4.4</link> and later.
    </para>
  </note>
  <para>
    When running on the primary node, <application>repmgrd</application> can
    monitor connections and in particular disconnections by its attached
    child nodes (standbys), and optionally execute a custom command
    if certain criteria are met (such as the number of attached nodes falling to
    zero following a failover to a new primary); this command can be used for
    example to &quot;fence&quot; the node and ensure it is isolated from any
    applications attempting to access the replication cluster.
  </para>
  <para>
    <itemizedlist>

      <listitem>
        <para>
          Every few seconds (defined by the configuration parameter <varname>child_nodes_check_interval</varname>
          (a value of <literal>0</literal> disables this altogether), <application>repmgrd</application> queries
          the <literal>pg_stat_replication</literal> system view and compares
          the nodes present there against the list of nodes registered with &repmgr; which
          should be attached to the primary.
        </para>
      </listitem>

      <listitem>
        <para>
          If a child node (standby) is no longer present in <literal>pg_stat_replication</literal>,
          <application>repmgrd</application> notes the time it detected the node's absence, and additionally generates a
          <literal>child_node_disconnect</literal> event.
        </para>
      </listitem>

      <listitem>
        <para>
          If a chile node (standby) which was absent from <literal>pg_stat_replication</literal> reappears,
          <application>repmgrd</application> clears the time it detected the node's absence, and additionally generates a
          <literal>child_node_reconnect</literal> event.
        </para>
      </listitem>

      <listitem>
        <para>
          If an entirely new child node (standby) is detected, <application>repmgrd</application> adds it to its internal list
          and additionally generates a <literal>child_node_new_connect</literal> event.
        </para>
      </listitem>

      <listitem>
        <para>
          If the <varname>child_nodes_disconnect_command</varname> parameter is set in
          <filename>repmgr.conf</filename>, <application>repmgrd</application> will then loop through all child nodes.
          If it determines that insufficient child nodes are connected, and a
          minimum of <varname>child_nodes_disconnect_timeout</varname> seconds (default: <literal>30</literal>
          has elapsed since  the last node became  disconnected, <application>repmgrd</application> will then execute the
          <varname>child_nodes_disconnect_command</varname> script.
        </para>
        <para>
          By default, the <varname>child_nodes_disconnect_command</varname> will only be executed
          if all child nodes are disconnected. If <varname>child_nodes_connected_min_count</varname>
          is set, the <varname>child_nodes_disconnect_command</varname> script will be triggered
          if the number of connected child nodes falls below the specified value (e.g.
          if set to <literal>2</literal>, the script will be triggered if only one child node
          is connected). Alternatively, if <varname>child_nodes_disconnect_min_count</varname>
          and more than that number of child nodes disconnects, the script will be triggered.
        </para>
        <para>
          The <varname>child_nodes_disconnect_command</varname> script will only be executed once
          while the criteria for its execution are met. If the criteria for its execution are no longer
          met (i.e. some child nodes have reconnected), it will be executed again if
          the criteria for its execution are met again.
        </para>
        <para>
          The <varname>child_nodes_disconnect_command</varname> script will not be executed if <application>repmgrd</application> is paused.
        </para>
      </listitem>

      <listitem>
        <para>
          Note that child nodes which are not attached when <application>repmgrd</application>
          starts will <emphasis>not</emphasis> be considered as missing, as <application>repmgrd</application>
          cannot know why they are not attached.
        </para>
      </listitem>

    </itemizedlist>
  </para>

</sect1>

<sect1 id="repmgrd-standby-disconnection-on-failover" xreflabel="Standby disconnection on failover">
  <indexterm>
   <primary>repmgrd</primary>
   <secondary>standby disconnection on failover</secondary>
 </indexterm>

  <indexterm>
    <primary>standby disconnection on failover</primary>
  </indexterm>

  <title>Standby disconnection on failover</title>
  <para>
    If <option>standby_disconnect_on_failover</option> is set to <literal>true</literal> in
    <filename>repmgr.conf</filename>, in a failover situation <application>repmgrd</application> will forcibly disconnect
    the local node's WAL receiver before making a failover decision.
  </para>
  <note>
    <para>
      <option>standby_disconnect_on_failover</option> is available from PostgreSQL 9.5 and later.
      Additionally this requires that the <literal>repmgr</literal> database user is a superuser.
    </para>
  </note>
  <para>
    By doing this, it's possible to ensure that, at the point the failover decision is made, no nodes
    are receiving data from the primary and their LSN location will be static.
  </para>
  <important>
    <para>
      <option>standby_disconnect_on_failover</option> <emphasis>must</emphasis> be set to the same value on
      all nodes.
    </para>
  </important>
  <para>
    Note that when using <option>standby_disconnect_on_failover</option> there will be a delay of 5 seconds
    plus however many seconds it takes to confirm the WAL receiver is disconnected before
    <application>repmgrd</application> proceeds with the failover decision.
  </para>
  <para>
    Following the failover operation, no matter what the outcome, each node will reconnect its WAL receiver.
  </para>
  <para>
    If using <option>standby_disconnect_on_failover</option>, we recommend that the
    <option>primary_visibility_consensus</option> option is also used.
  </para>

</sect1>

<sect1 id="repmgrd-failover-validation" xreflabel="Failover validation">
  <indexterm>
   <primary>repmgrd</primary>
   <secondary>failover validation</secondary>
 </indexterm>

  <indexterm>
    <primary>failover validation</primary>
  </indexterm>

  <title>Failover validation</title>
  <para>
    From <link linkend="release-4.3">repmgr 4.3</link>, &repmgr; makes it possible to provide a script
    to <application>repmgrd</application> which, in a failover situation,
    will be executed by the promotion candidate (the node which has been selected
    to be the new primary) to confirm whether the node should actually be promoted.
  </para>
  <para>
    To use this, <option>failover_validation_command</option> in <filename>repmgr.conf</filename>
    to a script executable by the <literal>postgres</literal> system user, e.g.:
    <programlisting>
      failover_validation_command=/path/to/script.sh %n %a</programlisting>
  </para>
  <para>
    The <literal>%n</literal> parameter will be replaced with the node ID, and the
    <literal>%a</literal> parameter will be replaced by the node name when the script is executed.
  </para>
  <para>
    This script must return an exit code of <literal>0</literal> to indicate the node should promote itself.
    Any other value will result in the promotion being aborted and the election rerun.
    There is a pause of <option>election_rerun_interval</option> seconds before the election is rerun.
  </para>
  <para>
    Sample <application>repmgrd</application> log file output during which the failover validation
    script rejects the proposed promotion candidate:
    <programlisting>
[2019-03-13 21:01:30] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
[2019-03-13 21:01:30] [NOTICE] promotion candidate is "node2" (ID: 2)
[2019-03-13 21:01:30] [NOTICE] executing "failover_validation_command"
[2019-03-13 21:01:30] [DETAIL] /usr/local/bin/failover-validation.sh 2
[2019-03-13 21:01:30] [INFO] output returned by failover validation command:
Node ID: 2

[2019-03-13 21:01:30] [NOTICE] failover validation command returned a non-zero value: "1"
[2019-03-13 21:01:30] [NOTICE] promotion candidate election will be rerun
[2019-03-13 21:01:30] [INFO] 1 followers to notify
[2019-03-13 21:01:30] [NOTICE] notifying node "node3" (node ID: 3) to rerun promotion candidate selection
INFO:  node 3 received notification to rerun promotion candidate election
[2019-03-13 21:01:30] [NOTICE] rerunning election after 15 seconds ("election_rerun_interval")</programlisting>
  </para>


</sect1>

  <sect1 id="cascading-replication" xreflabel="Cascading replication">
 <indexterm>
   <primary>repmgrd</primary>
   <secondary>cascading replication</secondary>
 </indexterm>

 <indexterm>
   <primary>cascading replication</primary>
   <secondary>repmgrd</secondary>
 </indexterm>

 <title>repmgrd and cascading replication</title>
 <para>
  Cascading replication - where a standby can connect to an upstream node and not
  the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and
  <application>repmgrd</application> support cascading replication by keeping track of the relationship
  between standby servers - each node record is stored with the node id of its
  upstream ("parent") server (except of course the primary server).
 </para>
 <para>
  In a failover situation where the primary node fails and a top-level standby
  is promoted, a standby connected to another standby will not be affected
  and continue working as normal (even if the upstream standby it's connected
  to becomes the primary node). If however the node's direct upstream fails,
  the &quot;cascaded standby&quot; will attempt to reconnect to that node's parent
  (unless <varname>failover</varname> is set to <literal>manual</literal> in
  <filename>repmgr.conf</filename>).
 </para>

  </sect1>


</chapter>