repmgr/doc/repmgrd-bdr.sgml

<chapter id="repmgrd-bdr">
  <indexterm>
    <primary>repmgrd</primary>
    <secondary>BDR</secondary>
  </indexterm>

  <indexterm>
    <primary>BDR</primary>
  </indexterm>

  <title>BDR failover with repmgrd</title>
  <para>
    &repmgr; 4.x provides support for monitoring BDR nodes and taking action in
    case one of the nodes fails.
  </para>
  <note>
    <simpara>
      Due to the nature of BDR, it's only safe to use this solution for
      a two-node scenario. Introducing additional nodes will create an inherent
      risk of node desynchronisation if a node goes down without being cleanly
      removed from the cluster.
    </simpara>
  </note>
  <para>
    In contrast to streaming replication, there's no concept of "promoting" a new
    primary node with BDR. Instead, "failover" involves monitoring both nodes
    with `repmgrd` and redirecting queries from the failed node to the remaining
    active node. This can be done by using an
    <link linkend="event-notifications">event notification</link> script
    which is called by <application>repmgrd</application> to dynamically
    reconfigure a proxy server/connection pooler such as <application>PgBouncer</application>.
  </para>

  <sect1 id="bdr-prerequisites" xreflabel="BDR prequisites">
    <title>Prerequisites</title>
    <para>
      &repmgr; 4 requires PostgreSQL 9.4 or 9.6 with the BDR 2 extension
      enabled and configured for a two-node BDR network. &repmgr; 4 packages
      must be installed on each node before attempting to configure
      <application>repmgr</application>.
    </para>
    <note>
      <simpara>
        &repmgr; 4 will refuse to install if it detects more than two BDR nodes.
      </simpara>
    </note>
    <para>
      Application database connections *must* be passed through a proxy server/
      connection pooler such as <application>PgBouncer</application>, and it must be possible to dynamically
      reconfigure that from <application>repmgrd</application>. The example demonstrated in this document
      will use <application>PgBouncer</application>
    </para>
    <para>
      The proxy server / connection poolers must <emphasis>not</emphasis>
      be installed on the database servers.
    </para>
    <para>
      For this example, it's assumed password-less SSH connections are available
      from the PostgreSQL servers to the servers where <application>PgBouncer</application>
      runs, and that the user on those servers has permission to alter the
      <application>PgBouncer</application> configuration files.
    </para>
    <para>
      PostgreSQL connections must be possible between each node, and each node
      must be able to connect to each PgBouncer instance.
    </para>
  </sect1>

  <sect1 id="bdr-configuration" xreflabel="BDR configuration">
    <title>Configuration</title>
    <para>
      A sample configuration for <filename>repmgr.conf</filename> on each
      BDR node would look like this:
      <programlisting>
        # Node information
        node_id=1
        node_name='node1'
        conninfo='host=node1 dbname=bdrtest user=repmgr connect_timeout=2'
        data_directory='/var/lib/postgresql/data'
        replication_type='bdr'

        # Event notification configuration
        event_notifications=bdr_failover
        event_notification_command='/path/to/bdr-pgbouncer.sh %n %e %s "%c" "%a" >> /tmp/bdr-failover.log 2>&1'

        # repmgrd options
        monitor_interval_secs=5
        reconnect_attempts=6
        reconnect_interval=5</programlisting>
    </para>
    <para>
      Adjust settings as appropriate; copy and adjust for the second node (particularly
      the values <varname>node_id</varname>, <varname>node_name</varname>
      and <varname>conninfo</varname>).
    </para>
    <para>
      Note that the values provided for the <varname>conninfo</varname> string
      must be valid for connections from <emphasis>both</emphasis> nodes in the
      replication cluster. The database must be the BDR-enabled database.
    </para>
    <para>
      If defined, the evenr <application>event_notifications</application> parameter
      will restrict execution of <varname>event_notification_command</varname>
      to the specified event(s).
    </para>
    <note>
      <simpara>
        <varname>event_notification_command</varname> is the script which does the actual "heavy lifting"
        of reconfiguring the proxy server/ connection pooler. It is fully
        user-definable; a reference implementation is documented below.
      </simpara>
    </note>

  </sect1>

  <sect1 id="bdr-repmgr-setup" xreflabel="repmgr setup with BDR">
    <title>repmgr setup</title>
    <para>
      Register both nodes; example on <literal>node1</literal>:
      <programlisting>
        $ repmgr -f /etc/repmgr.conf bdr register
        NOTICE: attempting to install extension "repmgr"
        NOTICE: "repmgr" extension successfully installed
        NOTICE: node record created for node 'node1' (ID: 1)
        NOTICE: BDR node 1 registered (conninfo: host=node1 dbname=bdrtest user=repmgr)</programlisting>
    </para>
    <para>
      and on <literal>node1</literal>:
      <programlisting>
        $ repmgr -f /etc/repmgr.conf bdr register
        NOTICE: node record created for node 'node2' (ID: 2)
        NOTICE: BDR node 2 registered (conninfo: host=node2 dbname=bdrtest user=repmgr)</programlisting>
    </para>
    <para>
      The <literal>repmgr</literal> extension will be automatically created
      when the first node is registered, and will be propagated to the second
      node.
    </para>
    <important>
      <simpara>
        Ensure the &repmgr; package is available on both nodes before
        attempting to register the first node.
      </simpara>
    </important>
    <para>
      At this point the meta data for both nodes has been created; executing
      <xref linkend="repmgr-cluster-show"> (on either node) should produce output like this:
      <programlisting>
        $ repmgr -f /etc/repmgr.conf cluster show
        ID | Name  | Role | Status    | Upstream | Location | Connection string
       ----+-------+------+-----------+----------+--------------------------------------------------------
        1  | node1 | bdr  | * running |          | default  | host=node1 dbname=bdrtest user=repmgr connect_timeout=2
        2  | node2 | bdr  | * running |          | default  | host=node2 dbname=bdrtest user=repmgr connect_timeout=2</programlisting>
    </para>
    <para>
      Additionally it's possible to display log of significant events;  executing
      <xref linkend="repmgr-cluster-event"> (on either node) should produce output like this:
      <programlisting>
        $ repmgr -f /etc/repmgr.conf cluster event
        Node ID | Event        | OK | Timestamp           | Details
       ---------+--------------+----+---------------------+----------------------------------------------
        2       | bdr_register | t  | 2017-07-27 17:51:48 | node record created for node 'node2' (ID: 2)
        1       | bdr_register | t  | 2017-07-27 17:51:00 | node record created for node 'node1' (ID: 1)
      </programlisting>
    </para>
    <para>
      At this point there will only be records for the two node registrations (displayed here
      in reverse chronological order).
    </para>
  </sect1>

  <sect1 id="bdr-event-notification-command" xreflabel="BDR failover event notification command">
    <title>Defining the "event_notification_command"</title>
    <para>
      Key to "failover" execution is the <literal>event_notification_command</literal>,
      which is a user-definable script specified in <filename>repmpgr.conf</filename>
      and which should reconfigure the proxy server/ connection pooler to point
      to the other, still-active node.
    </para>
    <para>
      Each time &repmgr; (or <application>repmgrd</application>) records an event,
      it can optionally execute the script defined in
      <literal>event_notification_command</literal> to take further action;
      details of the event will be passed as parameters.
    </para>
    <para>
      Following placeholders are available to the script:
    </para>

    <variablelist>
      <varlistentry>
        <term><option>%n</option></term>
        <listitem>
          <para>
            node ID
          </para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><option>%e</option></term>
        <listitem>
          <para>
            event type
          </para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><option>%t</option></term>
        <listitem>
          <para>
            success (1 or 0)
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term><option>%t</option></term>
        <listitem>
          <para>
            timestamp
          </para>
        </listitem>
      </varlistentry>

      <varlistentry>
        <term><option>%d</option></term>
        <listitem>
          <para>
            details
          </para>
        </listitem>
      </varlistentry>
    </variablelist>

    <para>
      Note that <literal>%c</literal> and <literal>%a</literal> will only be provided during
      <varname>bdr_failover</varname> events, which is what is of interest here.
    </para>
    <para>
      The provided sample script (`scripts/bdr-pgbouncer.sh`) is configured like
      this:
      <programlisting>
        event_notification_command='/path/to/bdr-pgbouncer.sh %n %e %s "%c" "%a"'</programlisting>
    </para>
    <para>
      and parses the configures parameters like this:
      <programlisting>
        NODE_ID=$1
        EVENT_TYPE=$2
        SUCCESS=$3
        NEXT_CONNINFO=$4
        NEXT_NODE_NAME=$5</programlisting>
    </para>
    <para>
      The script also contains some hard-coded values about the <application>PgBouncer</application>
      configuration for both nodes; these will need to be adjusted for your local environment
      (ideally the scripts would be maintained as templates and generated by some
      kind of provisioning system).
    </para>

    <para>
      The script performs following steps:
      <itemizedlist spacing="compact" mark="bullet">
        <listitem>
          <simpara>pauses <application>PgBouncer</application> on all nodes</simpara>
        </listitem>
        <listitem>
          <simpara>recreates the <application>PgBouncer</application> configuration file on each
            node using the information provided by <application>repmgrd</application>
            (primarily the <varname>conninfo</varname> string) to configure
            <application>PgBouncer</application></simpara>
        </listitem>
        <listitem>
          <simpara>reloads the <application>PgBouncer</application> configuration</simpara>
        </listitem>
        <listitem>
          <simpara>executes the <command>RESUME</command> command (in <application>PgBouncer</application>)</simpara>
        </listitem>
      </itemizedlist>
    </para>
    <para>
      Following successful script execution, any connections to PgBouncer on the failed BDR node
      will be redirected to the active node.
    </para>
  </sect1>

  <sect1 id="bdr-monitoring-failover" xreflabel="Node monitoring and failover">
    <title>Node monitoring and failover</title>
    <para>
      At the intervals specified by <varname>monitor_interval_secs</varname>
      in <filename>repmgr.conf</filename>, <application>repmgrd</application>
      will ping each node to check if it's available. If a node isn't available,
      <application>repmgrd</application> will enter failover mode and check <varname>reconnect_attempts</varname>
      times at intervals of <varname>reconnect_interval</varname> to confirm the node is definitely unreachable.
      This buffer period is necessary to avoid false positives caused by transient
      network outages.
    </para>
    <para>
      If the node is still unavailable, <application>repmgrd</application> will enter failover mode and execute
      the script defined in <varname>event_notification_command</varname>; an entry will be logged
      in the <literal>repmgr.events</literal> table and <application>repmgrd</application> will
      (unless otherwise configured) resume monitoring of the node in "degraded" mode until it reappears.
    </para>
    <para>
      <application>repmgrd</application> logfile output during a failover event will look something like this
      on one node (usually the node which has failed, here <literal>node2</literal>):
      <programlisting>
            ...
    [2017-07-27 21:08:39] [INFO] starting continuous BDR node monitoring
    [2017-07-27 21:08:39] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:08:55] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:09:11] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
    [2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
    [2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
    [2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
    [2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
    [2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
    [2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
    [2017-07-27 21:09:28] [NOTICE] setting node record for node 2 to inactive
    [2017-07-27 21:09:28] [INFO] executing notification command for event "bdr_failover"
    [2017-07-27 21:09:28] [DETAIL] command is:
      /path/to/bdr-pgbouncer.sh 2 bdr_failover 1 "host=host=node1 dbname=bdrtest user=repmgr connect_timeout=2" "node1"
    [2017-07-27 21:09:28] [INFO] node 'node2' (ID: 2) detected as failed; next available node is 'node1' (ID: 1)
    [2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
    ...</programlisting>
    </para>
    <para>
      Output on the other node (<literal>node1</literal>) during the same event will look like this:
      <programlisting>
    ...
    [2017-07-27 21:08:35] [INFO] starting continuous BDR node monitoring
    [2017-07-27 21:08:35] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:08:51] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:09:07] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
    [2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
    [2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
    [2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
    [2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
    [2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
    [2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
    [2017-07-27 21:09:28] [NOTICE] other node's repmgrd is handling failover
    [2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
    ...</programlisting>
    </para>
    <para>
      This assumes only the PostgreSQL instance on <literal>node2</literal> has failed. In this case the
      <application>repmgrd</application> instance running on <literal>node2</literal> has performed the failover. However if
      the entire server becomes unavailable, <application>repmgrd</application> on <literal>node1</literal> will perform
      the failover.
    </para>
  </sect1>
  <sect1 id="bdr-node-recovery" xreflabel="Node recovery">
    <title>Node recovery</title>
    <para>
      Following failure of a BDR node, if the node subsequently becomes available again,
      a <varname>bdr_recovery</varname> event will be generated. This could potentially be used to
      reconfigure PgBouncer automatically to bring the node back into the available pool,
      however it would be prudent to manually verify the node's status before
      exposing it to the application.
    </para>
    <para>
      If the failed node comes back up and connects correctly, output similar to this
      will be visible in the <application>repmgrd</application> log:
      <programlisting>
        [2017-07-27 21:25:30] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
        [2017-07-27 21:25:46] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
        [2017-07-27 21:25:46] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
        [2017-07-27 21:25:55] [INFO] active replication slot for node "node1" found after 1 seconds
        [2017-07-27 21:25:55] [NOTICE] node "node2" (ID: 2) has recovered after 986 seconds</programlisting>
    </para>
  </sect1>

  <sect1 id="bdr-complete-shutdown" xreflabel="Shutdown of both nodes">
    <title>Shutdown of both nodes</title>
    <para>
      If both PostgreSQL instances are shut down, <application>repmgrd</application> will try and handle the
      situation as gracefully as possible, though with no failover candidates available
      there's not much it can do. Should this case ever occur, we recommend shutting
      down <application>repmgrd</application> on both nodes and restarting it once the PostgreSQL instances
      are running properly.
    </para>
  </sect1>
</chapter>