From dd73039d02341ebd207966b9129cb2675c05e586 Mon Sep 17 00:00:00 2001
From: Ian Barwick <ian@2ndquadrant.com>
Date: Thu, 27 Jul 2017 21:44:10 +0900
Subject: [PATCH] Update BDR documentation

---
 doc/bdr-failover.md | 138 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 129 insertions(+), 9 deletions(-)

diff --git a/doc/bdr-failover.md b/doc/bdr-failover.md
index cf4e5a1a..f8c11bc5 100644
--- a/doc/bdr-failover.md
+++ b/doc/bdr-failover.md
@@ -35,11 +35,18 @@ will use PgBouncer.
 The proxy server / connection poolers must not be installed on the database
 servers.
 
+For this example, it's assumed password-less SSH connections are available
+from the PostgreSQL servers to the servers where PgBouncer runs, and
+that the user on those servers has permission to alter the PgBouncer
+configuration files.
+
+PostgreSQL connections must be possible between each node, and each node
+must be able to connect to each PgBouncer instance.
+
 
 Configuration
 -------------
 
-
 Sample configuration for `repmgr.conf`:
 
     node_id=1
@@ -51,14 +58,16 @@ Sample configuration for `repmgr.conf`:
     event_notification_command='/path/to/bdr-pgbouncer.sh %n %e %s "%c" "%a" >> /tmp/bdr-failover.log 2>&1'
 
     # repmgrd options
-    reconnect_attempts=5
-    reconnect_interval=6
+    monitor_interval_secs=5
+    reconnect_attempts=6
+    reconnect_interval=5
 
 Adjust settings as appropriate; copy and adjust for the second node (particularly
 the values `node_id`, `node_name` and `conninfo`).
 
 Note that the values provided for the `conninfo` string must be valid for
-connections from *both* nodes in the cluster.
+connections from *both* nodes in the cluster. The database must be the BDR
+database.
 
 If defined, `event_notifications` will restrict execution of `event_notification_command`
 to the specified events.
@@ -68,6 +77,17 @@ of reconfiguring the proxy server/ connection pooler. It is fully user-definable
 a sample implementation is documented below.
 
 
+repmgr user permissions
+-----------------------
+
+`repmgr` will create an extension in the BDR database containing objects
+for administering `repmgr` metadata. The user defined in the `conninfo`
+setting must be able to access all objects. Additionally, superuser permissions
+are required to install the `repmgr` extension. The easiest way to do this
+is create the `repmgr` user as a superuser, however if this is not
+desirable, the `repmgr` user can be created as a normal user and a
+superuser specified with `--superuser` when registering a BDR node.
+
 repmgr setup
 ------------
 
@@ -95,9 +115,9 @@ At this point the meta data for both nodes has been created; executing
 
     $ repmgr -f /etc/repmgr.conf cluster show
      ID | Name  | Role | Status    | Upstream | Connection string
-    ----+-------+------+-----------+----------+-----------------------------------------------------
-     1  | node1 | bdr  | * running |          | host=node1 dbname=bdrtest user=repmgr
-     2  | node2 | bdr  | * running |          | host=node2 dbname=bdrtest user=repmgr
+    ----+-------+------+-----------+----------+--------------------------------------------------------
+     1  | node1 | bdr  | * running |          | host=node1 dbname=bdrtest user=repmgr connect_timeout=2
+     2  | node2 | bdr  | * running |          | host=node2 dbname=bdrtest user=repmgr connect_timeout=2
 
 Additionally it's possible to see a log of significant events; so far
 this will only record the two node registrations (in reverse chronological order):
@@ -149,13 +169,94 @@ both nodes; these will need to be adjusted for your local environment of course
 (ideally the scripts would be maintained as templates and generated by some
 kind of provisioning system).
 
+The script performs following steps:
+
+ - pauses PgBouncer on all nodes
+ - recreates the PgBouncer configuration file on each node using the information
+   provided by `repmgrd` (mainly the `conninfo` string) to configure PgBouncer
+   to point to the remaining node
+ - reloads the PgBouncer configuration
+ - resumes PgBouncer
+
+From that point, any connections to PgBouncer on the failed BDR node will be redirected
+to the active node.
 
 
 repmgrd
 -------
 
-Node failover
--------------
+
+
+Node monitoring and failover
+----------------------------
+
+At the intervals specified by `monitor_interval_secs` in `repmgr.conf`, `repmgrd`
+will ping each node to check if it's available. If a node isn't available,
+`repmgrd` will enter failover mode and  check `reconnect_attempts` times
+at intervals of `reconnect_interval` to confirm the node is definitely unreachable.
+This buffer period is necessary to avoid false positives caused by transient
+network outages.
+
+If the node is still unavailable, `repmgrd` will enter failover mode and execute
+the script defined in `event_notification_command`; an entry will be logged
+in the `repmgr.events` table and `repmgrd` will (unless otherwise configured)
+resume monitoring of the node in "degraded" mode until it reappears.
+
+`repmgrd` logfile output during a failover event will look something like this
+one one node (usually the node which has failed, here "node2"):
+
+    ...
+    [2017-07-27 21:08:39] [INFO] starting continuous BDR node monitoring
+    [2017-07-27 21:08:39] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
+    [2017-07-27 21:08:55] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
+    [2017-07-27 21:09:11] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
+    [2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
+    [2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
+    [2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
+    [2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
+    [2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
+    [2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
+    [2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
+    [2017-07-27 21:09:28] [NOTICE] setting node record for node 2 to inactive
+    [2017-07-27 21:09:28] [INFO] executing notification command for event "bdr_failover"
+    [2017-07-27 21:09:28] [DETAIL] command is:
+      /path/to/bdr-pgbouncer.sh 2 bdr_failover 1 "host=host=node1 dbname=bdrtest user=repmgr connect_timeout=2" "node1"
+    [2017-07-27 21:09:28] [INFO] node 'node2' (ID: 2) detected as failed; next available node is 'node1' (ID: 1)
+    [2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
+    [2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
+    ...
+
+Output on the other node ("node1") during the same event will look like this:
+
+    [2017-07-27 21:08:35] [INFO] starting continuous BDR node monitoring
+    [2017-07-27 21:08:35] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
+    [2017-07-27 21:08:51] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
+    [2017-07-27 21:09:07] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
+    [2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
+    [2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
+    [2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
+    [2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
+    [2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
+    [2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
+    [2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
+    [2017-07-27 21:09:28] [NOTICE] other node's repmgrd is handling failover
+    [2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
+    [2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
+
+This assumes only the PostgreSQL instance on "node2" has failed. In this case the
+`repmgrd` instance running on "node2" has performed the failover. However if
+the entire server becomes unavailable, `repmgrd` on "node1" will perform
+the failover.
 
 
 Node recovery
@@ -166,3 +267,22 @@ a `bdr_recovery` event will be generated. This could potentially be used to
 reconfigure PgBouncer automatically to bring the node back into the available pool,
 however it would be prudent to manually verify the node's status before
 exposing it to the application.
+
+If the failed node comes back up and connects correctly, output similar to this
+will be visible in the `repmgrd` log:
+
+    [2017-07-27 21:25:30] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
+    [2017-07-27 21:25:46] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
+    [2017-07-27 21:25:46] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
+    [2017-07-27 21:25:55] [INFO] active replication slot for node "node1" found after 1 seconds
+    [2017-07-27 21:25:55] [NOTICE] node "node2" (ID: 2) has recovered after 986 seconds
+
+
+Shutdown of both nodes
+----------------------
+
+If both PostgreSQL instances are shut down, `repmgrd` will try and handle the
+situation as gracefully as possible, though with no failover candidates available
+there's not much it can do. Should this case ever occur, we recommend shutting
+down `repmgrd` on both nodes and restarting it once the PostgreSQL instances
+are running properly.