BDR failover with repmgrdrepmgrdBDRBDR
&repmgr; 4.x provides support for monitoring a pair of BDR 2.x nodes and taking action in
case one of the nodes fails.
Due to the nature of BDR 1.x/2.x, it's only safe to use this solution for
a two-node scenario. Introducing additional nodes will create an inherent
risk of node desynchronisation if a node goes down without being cleanly
removed from the cluster.
In contrast to streaming replication, there's no concept of "promoting" a new
primary node with BDR. Instead, "failover" involves monitoring both nodes
with &repmgrd; and redirecting queries from the failed node to the remaining
active node. This can be done by using an
event notification script
which is called by &repmgrd; to dynamically
reconfigure a proxy server/connection pooler such as PgBouncer.
This &repmgr; functionality is for BDR 2.x only running on PostgreSQL 9.4/9.6.
It is not required for later BDR versions.
Prerequisites
This &repmgr; functionality is for BDR 2.x only running on PostgreSQL 9.4/9.6.
It is not required for later BDR versions.
&repmgr; 4 requires PostgreSQL 9.4 or 9.6 with the BDR 2 extension
enabled and configured for a two-node BDR network. &repmgr; 4 packages
must be installed on each node before attempting to configure
repmgr.
&repmgr; 4 will refuse to install if it detects more than two BDR nodes.
Application database connections *must* be passed through a proxy server/
connection pooler such as PgBouncer, and it must be possible to dynamically
reconfigure that from &repmgrd;. The example demonstrated in this document
will use PgBouncer
The proxy server / connection poolers must not
be installed on the database servers.
For this example, it's assumed password-less SSH connections are available
from the PostgreSQL servers to the servers where PgBouncer
runs, and that the user on those servers has permission to alter the
PgBouncer configuration files.
PostgreSQL connections must be possible between each node, and each node
must be able to connect to each PgBouncer instance.
Configuration
A sample configuration for repmgr.conf on each
BDR node would look like this:
# Node information
node_id=1
node_name='node1'
conninfo='host=node1 dbname=bdrtest user=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/data'
replication_type='bdr'
# Event notification configuration
event_notifications=bdr_failover
event_notification_command='/path/to/bdr-pgbouncer.sh %n %e %s "%c" "%a" >> /tmp/bdr-failover.log 2>&1'
# repmgrd options
monitor_interval_secs=5
reconnect_attempts=6
reconnect_interval=5
Adjust settings as appropriate; copy and adjust for the second node (particularly
the values node_id, node_name
and conninfo).
Note that the values provided for the conninfo string
must be valid for connections from both nodes in the
replication cluster. The database must be the BDR-enabled database.
If defined, the event_notifications parameter will restrict
execution of the script defined in event_notification_command
to the specified event(s).
event_notification_command is the script which does the actual "heavy lifting"
of reconfiguring the proxy server/ connection pooler. It is fully
user-definable; see section for a reference
implementation.
repmgr setup
Register both nodes; example on node1:
$ repmgr -f /etc/repmgr.conf bdr register
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: node record created for node 'node1' (ID: 1)
NOTICE: BDR node 1 registered (conninfo: host=node1 dbname=bdrtest user=repmgr)
and on node1:
$ repmgr -f /etc/repmgr.conf bdr register
NOTICE: node record created for node 'node2' (ID: 2)
NOTICE: BDR node 2 registered (conninfo: host=node2 dbname=bdrtest user=repmgr)
The repmgr extension will be automatically created
when the first node is registered, and will be propagated to the second
node.
Ensure the &repmgr; package is available on both nodes before
attempting to register the first node.
At this point the meta data for both nodes has been created; executing
(on either node) should produce output like this:
$ repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+-------+------+-----------+----------+--------------------------------------------------------
1 | node1 | bdr | * running | | default | host=node1 dbname=bdrtest user=repmgr connect_timeout=2
2 | node2 | bdr | * running | | default | host=node2 dbname=bdrtest user=repmgr connect_timeout=2
Additionally it's possible to display log of significant events; executing
(on either node) should produce output like this:
$ repmgr -f /etc/repmgr.conf cluster event
Node ID | Event | OK | Timestamp | Details
---------+--------------+----+---------------------+----------------------------------------------
2 | bdr_register | t | 2017-07-27 17:51:48 | node record created for node 'node2' (ID: 2)
1 | bdr_register | t | 2017-07-27 17:51:00 | node record created for node 'node1' (ID: 1)
At this point there will only be records for the two node registrations (displayed here
in reverse chronological order).
Defining the BDR failover "event_notification_command"
Key to "failover" execution is the event_notification_command,
which is a user-definable script specified in repmpgr.conf
and which can use a &repmgr; event notification
to reconfigure the proxy server / connection pooler so it points to the other, still-active node.
Details of the event will be passed as parameters to the script.
Following parameter placeholders are available for the script definition in repmpgr.conf;
these will be replaced with the appropriate value when the script is executed:
node ID
event type
success (1 or 0)
timestamp
details
conninfo string of the next available node (bdr_failover and bdr_recovery)
name of the next available node (bdr_failover and bdr_recovery)
Note that %c and %a are only provided with
particular failover events, in this case bdr_failover.
The provided sample script
(scripts/bdr-pgbouncer.sh)
is configured as follows:
event_notification_command='/path/to/bdr-pgbouncer.sh %n %e %s "%c" "%a"'
and parses the placeholder parameters like this:
NODE_ID=$1
EVENT_TYPE=$2
SUCCESS=$3
NEXT_CONNINFO=$4
NEXT_NODE_NAME=$5
The sample script also contains some hard-coded values for the PgBouncer
configuration for both nodes; these will need to be adjusted for your local environment
(ideally the scripts would be maintained as templates and generated by some
kind of provisioning system).
The script performs following steps:
pauses PgBouncer on all nodesrecreates the PgBouncer configuration file on each
node using the information provided by &repmgrd;
(primarily the conninfo string) to configure
PgBouncerreloads the PgBouncer configurationexecutes the RESUME command (in PgBouncer)
Following successful script execution, any connections to PgBouncer on the failed BDR node
will be redirected to the active node.
Node monitoring and failover
At the intervals specified by monitor_interval_secs
in repmgr.conf, &repmgrd;
will ping each node to check if it's available. If a node isn't available,
&repmgrd; will enter failover mode and check reconnect_attempts
times at intervals of reconnect_interval to confirm the node is definitely unreachable.
This buffer period is necessary to avoid false positives caused by transient
network outages.
If the node is still unavailable, &repmgrd; will enter failover mode and execute
the script defined in event_notification_command; an entry will be logged
in the repmgr.events table and &repmgrd; will
(unless otherwise configured) resume monitoring of the node in "degraded" mode until it reappears.
&repmgrd; logfile output during a failover event will look something like this
on one node (usually the node which has failed, here node2):
...
[2017-07-27 21:08:39] [INFO] starting continuous BDR node monitoring
[2017-07-27 21:08:39] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
[2017-07-27 21:08:55] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
[2017-07-27 21:09:11] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
[2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
[2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
[2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
[2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
[2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
[2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
[2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
[2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
[2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
[2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
[2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
[2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
[2017-07-27 21:09:28] [NOTICE] setting node record for node 2 to inactive
[2017-07-27 21:09:28] [INFO] executing notification command for event "bdr_failover"
[2017-07-27 21:09:28] [DETAIL] command is:
/path/to/bdr-pgbouncer.sh 2 bdr_failover 1 "host=host=node1 dbname=bdrtest user=repmgr connect_timeout=2" "node1"
[2017-07-27 21:09:28] [INFO] node 'node2' (ID: 2) detected as failed; next available node is 'node1' (ID: 1)
[2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
[2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
...
Output on the other node (node1) during the same event will look like this:
...
[2017-07-27 21:08:35] [INFO] starting continuous BDR node monitoring
[2017-07-27 21:08:35] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
[2017-07-27 21:08:51] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
[2017-07-27 21:09:07] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
[2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
[2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
[2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
[2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
[2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
[2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
[2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
[2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
[2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
[2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
[2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
[2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
[2017-07-27 21:09:28] [NOTICE] other node's repmgrd is handling failover
[2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
[2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
...
This assumes only the PostgreSQL instance on node2 has failed. In this case the
&repmgrd; instance running on node2 has performed the failover. However if
the entire server becomes unavailable, &repmgrd; on node1 will perform
the failover.
Node recovery
Following failure of a BDR node, if the node subsequently becomes available again,
a bdr_recovery event will be generated. This could potentially be used to
reconfigure PgBouncer automatically to bring the node back into the available pool,
however it would be prudent to manually verify the node's status before
exposing it to the application.
If the failed node comes back up and connects correctly, output similar to this
will be visible in the &repmgrd; log:
[2017-07-27 21:25:30] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
[2017-07-27 21:25:46] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
[2017-07-27 21:25:46] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
[2017-07-27 21:25:55] [INFO] active replication slot for node "node1" found after 1 seconds
[2017-07-27 21:25:55] [NOTICE] node "node2" (ID: 2) has recovered after 986 secondsShutdown of both nodes
If both PostgreSQL instances are shut down, &repmgrd; will try and handle the
situation as gracefully as possible, though with no failover candidates available
there's not much it can do. Should this case ever occur, we recommend shutting
down &repmgrd; on both nodes and restarting it once the PostgreSQL instances
are running properly.