repmgr/doc/bdr-failover.md

BDR failover with repmgrd
=========================

`repmgr 4` provides support for monitoring BDR nodes and taking action in case
one of the nodes fails.

    *NOTE* Due to the nature of BDR, it's only safe to use this solution for
    a two-node scenario. Introducing additional nodes will create an inherent
    risk of node desynchronisation if a node goes down without being cleanly
    removed from the cluster.

In contrast to streaming replication, there's no concept of "promoting" a new
primary node with BDR. Instead, "failover" involves monitoring both nodes
with `repmgrd` and redirecting queries from the failed node to the remaining
active node. This can be done by using the event notification script generated by
`repmgrd` to dynamically reconfigure a proxy server/connection pooler such
as PgBouncer.


Prerequisites
-------------

`repmgr 4` requires PostgreSQL 9.6 with the BDR 2 extension enabled and
configured for a two-node BDR network. `repmgr 4` packages
must be installed on each node before attempting to configure repmgr.

    *NOTE* `repmgr 4` will refuse to install if it detects more than two
    BDR nodes.

Application database connections *must* be passed through a proxy server/
connection pooler such as PgBouncer, and it must be possible to dynamically
reconfigure that from `repmgrd`. The example demonstrated in this document
will use PgBouncer.

The proxy server / connection poolers must not be installed on the database
servers.

For this example, it's assumed password-less SSH connections are available
from the PostgreSQL servers to the servers where PgBouncer runs, and
that the user on those servers has permission to alter the PgBouncer
configuration files.

PostgreSQL connections must be possible between each node, and each node
must be able to connect to each PgBouncer instance.


Configuration
-------------

Sample configuration for `repmgr.conf`:

    node_id=1
    node_name='node1'
    conninfo='host=node1 dbname=bdrtest user=repmgr connect_timeout=2'
    replication_type='bdr'

    event_notifications=bdr_failover
    event_notification_command='/path/to/bdr-pgbouncer.sh %n %e %s "%c" "%a" >> /tmp/bdr-failover.log 2>&1'

    # repmgrd options
    monitor_interval_secs=5
    reconnect_attempts=6
    reconnect_interval=5

Adjust settings as appropriate; copy and adjust for the second node (particularly
the values `node_id`, `node_name` and `conninfo`).

Note that the values provided for the `conninfo` string must be valid for
connections from *both* nodes in the cluster. The database must be the BDR
database.

If defined, `event_notifications` will restrict execution of `event_notification_command`
to the specified events.

`event_notification_command` is the script which does the actual "heavy lifting"
of reconfiguring the proxy server/ connection pooler. It is fully user-definable;
a sample implementation is documented below.


repmgr user permissions
-----------------------

`repmgr` will create an extension in the BDR database containing objects
for administering `repmgr` metadata. The user defined in the `conninfo`
setting must be able to access all objects. Additionally, superuser permissions
are required to install the `repmgr` extension. The easiest way to do this
is create the `repmgr` user as a superuser, however if this is not
desirable, the `repmgr` user can be created as a normal user and a
superuser specified with `--superuser` when registering a BDR node.

repmgr setup
------------

Register both nodes:

    $ repmgr -f /etc/repmgr.conf bdr register
    NOTICE: attempting to install extension "repmgr"
    NOTICE: "repmgr" extension successfully installed
    NOTICE: node record created for node 'node1' (ID: 1)
    NOTICE: BDR node 1 registered (conninfo: host=localhost dbname=bdrtest user=repmgr port=5501)

    $ repmgr -f /etc/repmgr.conf bdr register
    NOTICE: node record created for node 'node2' (ID: 2)
    NOTICE: BDR node 2 registered (conninfo: host=localhost dbname=bdrtest user=repmgr port=5502)

The `repmgr` extension will be automatically created when the first
node is registered, and will be propagated to the second node.

    *IMPORTANT* ensure the repmgr package is available on both nodes before
    attempting to register the first node


At this point the meta data for both nodes has been created; executing
`repmgr cluster show` (on either node) should produce output like this:

    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role | Status    | Upstream | Connection string
    ----+-------+------+-----------+----------+--------------------------------------------------------
     1  | node1 | bdr  | * running |          | host=node1 dbname=bdrtest user=repmgr connect_timeout=2
     2  | node2 | bdr  | * running |          | host=node2 dbname=bdrtest user=repmgr connect_timeout=2

Additionally it's possible to see a log of significant events; so far
this will only record the two node registrations (in reverse chronological order):

     Node ID | Event        | OK | Timestamp           | Details
    ---------+--------------+----+---------------------+----------------------------------------------
     2       | bdr_register | t  | 2017-07-27 17:51:48 | node record created for node 'node2' (ID: 2)
     1       | bdr_register | t  | 2017-07-27 17:51:00 | node record created for node 'node1' (ID: 1)


Defining the "event_notification_command"
-----------------------------------------

Key to "failover" execution is the `event_notification_command`, which is a
user-definable script which should reconfigure the  proxy server/
connection pooler.

Each time `repmgr` (or `repmgrd`) records an event, it can optionally
execute the script defined in `event_notification_command` to
take further action; details of the event will be passed as parameters.
Following placeholders are available to the script:

    %n - node ID
    %e - event type
    %s - success (1 or 0)
    %t - timestamp
    %d - details
    %c - conninfo string of the next available node
    %a - name of the next available node

Note that `%c` and `%a` will only be provided during `bdr_failover`
events, which is what is of interest here.

The provided sample script (`scripts/bdr-pgbouncer.sh`) is configured like
this:

    event_notification_command='/path/to/bdr-pgbouncer.sh %n %e %s "%c" "%a"'

and parses the configures parameters like this:

    NODE_ID=$1
    EVENT_TYPE=$2
    SUCCESS=$3
    NEXT_CONNINFO=$4
    NEXT_NODE_NAME=$5

It also contains some hard-coded values about the PgBouncer configuration for
both nodes; these will need to be adjusted for your local environment of course
(ideally the scripts would be maintained as templates and generated by some
kind of provisioning system).

The script performs following steps:

 - pauses PgBouncer on all nodes
 - recreates the PgBouncer configuration file on each node using the information
   provided by `repmgrd` (mainly the `conninfo` string) to configure PgBouncer
   to point to the remaining node
 - reloads the PgBouncer configuration
 - resumes PgBouncer

From that point, any connections to PgBouncer on the failed BDR node will be redirected
to the active node.


repmgrd
-------


Node monitoring and failover
----------------------------

At the intervals specified by `monitor_interval_secs` in `repmgr.conf`, `repmgrd`
will ping each node to check if it's available. If a node isn't available,
`repmgrd` will enter failover mode and  check `reconnect_attempts` times
at intervals of `reconnect_interval` to confirm the node is definitely unreachable.
This buffer period is necessary to avoid false positives caused by transient
network outages.

If the node is still unavailable, `repmgrd` will enter failover mode and execute
the script defined in `event_notification_command`; an entry will be logged
in the `repmgr.events` table and `repmgrd` will (unless otherwise configured)
resume monitoring of the node in "degraded" mode until it reappears.

`repmgrd` logfile output during a failover event will look something like this
one one node (usually the node which has failed, here "node2"):

    ...
    [2017-07-27 21:08:39] [INFO] starting continuous BDR node monitoring
    [2017-07-27 21:08:39] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:08:55] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:09:11] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
    [2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
    [2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
    [2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
    [2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
    [2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
    [2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
    [2017-07-27 21:09:28] [NOTICE] setting node record for node 2 to inactive
    [2017-07-27 21:09:28] [INFO] executing notification command for event "bdr_failover"
    [2017-07-27 21:09:28] [DETAIL] command is:
      /path/to/bdr-pgbouncer.sh 2 bdr_failover 1 "host=host=node1 dbname=bdrtest user=repmgr connect_timeout=2" "node1"
    [2017-07-27 21:09:28] [INFO] node 'node2' (ID: 2) detected as failed; next available node is 'node1' (ID: 1)
    [2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
    ...

Output on the other node ("node1") during the same event will look like this:

    [2017-07-27 21:08:35] [INFO] starting continuous BDR node monitoring
    [2017-07-27 21:08:35] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:08:51] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:09:07] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
    [2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
    [2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
    [2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
    [2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
    [2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
    [2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
    [2017-07-27 21:09:28] [NOTICE] other node's repmgrd is handling failover
    [2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
    [2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode

This assumes only the PostgreSQL instance on "node2" has failed. In this case the
`repmgrd` instance running on "node2" has performed the failover. However if
the entire server becomes unavailable, `repmgrd` on "node1" will perform
the failover.


Node recovery
-------------

Following failure of a BDR node, if the node subsequently becomes available again,
a `bdr_recovery` event will be generated. This could potentially be used to
reconfigure PgBouncer automatically to bring the node back into the available pool,
however it would be prudent to manually verify the node's status before
exposing it to the application.

If the failed node comes back up and connects correctly, output similar to this
will be visible in the `repmgrd` log:

    [2017-07-27 21:25:30] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
    [2017-07-27 21:25:46] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
    [2017-07-27 21:25:46] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
    [2017-07-27 21:25:55] [INFO] active replication slot for node "node1" found after 1 seconds
    [2017-07-27 21:25:55] [NOTICE] node "node2" (ID: 2) has recovered after 986 seconds


Shutdown of both nodes
----------------------

If both PostgreSQL instances are shut down, `repmgrd` will try and handle the
situation as gracefully as possible, though with no failover candidates available
there's not much it can do. Should this case ever occur, we recommend shutting
down `repmgrd` on both nodes and restarting it once the PostgreSQL instances
are running properly.