From 5945accd8401c88e506c73bd404f44c69524c7b2 Mon Sep 17 00:00:00 2001
From: Ian Barwick <ian@2ndquadrant.com>
Date: Wed, 5 Oct 2016 11:25:36 +0900
Subject: [PATCH] Add documentation for repmgrd failover process and failed
 node fencing

Addresses GitHub #200.
---
 README.md                          |  21 +++-
 docs/repmgrd-failover-mechanism.md |  75 +++++++++++++++
 docs/repmgrd-node-fencing.md       | 150 +++++++++++++++++++++++++++++
 3 files changed, 243 insertions(+), 3 deletions(-)
 create mode 100644 docs/repmgrd-failover-mechanism.md
 create mode 100644 docs/repmgrd-node-fencing.md

diff --git a/README.md b/README.md
index 31225880..8c236c8a 100644
--- a/README.md
+++ b/README.md
@@ -580,13 +580,13 @@ base backups and WAL files.
 
 Barman support provides the following advantages:
 
-- the primary node does not need to perform a new backup every time a
+- the master node does not need to perform a new backup every time a
   new standby is cloned;
 - a standby node can be disconnected for longer periods without losing
   the ability to catch up, and without causing accumulation of WAL
-  files on the primary node;
+  files on the master node;
 - therefore, `repmgr` does not need to use replication slots, and the
-  primary node does not need to set `wal_keep_segments`.
+  master node does not need to set `wal_keep_segments`.
 
 > *NOTE*: In view of the above, Barman support is incompatible with
 > the `use_replication_slots` setting in `repmgr.conf`.
@@ -1743,6 +1743,21 @@ which contains connection details for the local database.
     the current working directory; no additional arguments are required.
 
 
+### Further documentation
+
+As well as this README, the `repmgr` source contains following additional
+documentation files:
+
+* FAQ.md - frequently asked questions
+* CONTRIBUTING.md - how to contribute to `repmgr`
+* PACKAGES.md - details on building packages
+* SSH-RSYNC.md - how to set up passwordless SSH between nodes
+* docs/repmgrd-failover-mechanism.md - how repmgrd picks which node to promote
+* docs/repmgrd-node-fencing.md - how to "fence" a failed master node
+
+
+
+
 ### Error codes
 
 `repmgr` or `repmgrd` will return one of the following error codes on program
diff --git a/docs/repmgrd-failover-mechanism.md b/docs/repmgrd-failover-mechanism.md
new file mode 100644
index 00000000..b6d013ae
--- /dev/null
+++ b/docs/repmgrd-failover-mechanism.md
@@ -0,0 +1,75 @@
+repmgrd's failover algorithm
+============================
+
+When implementing automatic failover, there are two factors which are critical in
+ensuring the desired result is achieved:
+
+  - has the master node genuinely failed?
+  - which is the best node to promote to the new master?
+
+This document outlines repmgrd's decision-making process during automatic failover
+for standbys directly connected to the master node.
+
+
+Master node failure detection
+-----------------------------
+
+If a `repmgrd` instance running on a PostgreSQL standby node is unable to connect to
+the master node, this doesn't neccesarily mean that the master is down and a
+failover is required. Factors such as network connectivity issues could mean that
+even though the standby node is isolated, the replication cluster as a whole
+is functioning correctly, and promoting the standby without further verification
+could result in a "split-brain" situation.
+
+In the event that `repmgrd` is unable to connect to the master node, it will attempt
+to reconnect to the master server several times (as defined by the `reconnect_attempts`
+parameter in `repmgr.conf`), with reconnection attempts  occuring at the interval
+specified by `reconnect_interval`. This happens to verify that the master is definitively
+not accessible (e.g. that connection was not lost due to a brief network glitch).
+
+Appropriate values for these settings will depend very much on the replication
+cluster environment. There will necessarily be a trade-off between the time it
+takes to assume the master is not reachable, and the reliability of that conclusion.
+A standby in a different physical location to the master will probably need a longer
+check interval to rule out possible network issues, whereas one located in the same
+rack with a direct connection between servers could perform the check very quickly.
+
+Note that it's possible the master comes back online after this point is reached,
+but before a new master has been selected; in this case it will be noticed
+during the selection of a new master and no actual failover will take place.
+
+Promotion candidate selection
+-----------------------------
+
+Once `repmgrd` has decided the master is definitively unreachable, following checks
+will be carried out:
+
+* attempts to connect to all other nodes in the cluster (including the witness
+  node, if defined) to establish the state of the cluster, including their
+  current LSN
+
+* If less than half of the nodes are visible (from the viewpoint
+  of this node), `repmgrd` will not take any further action. This is to ensure that
+  e.g. if a replication cluster is spread over multiple data centres, a split-brain
+  situation does not occur if there is a network failure between datacentres. Note
+  that if nodes are split evenly between data centres, a witness server can be
+  used to establish the "majority" daat centre.
+
+* `repmgrd` polls all visible servers and waits for each node to return a valid LSN;
+  it updates the LSN previously  stored for this node if it has increased since
+  the initial check
+
+* once all LSNs have been retrieved, `repmgrd` will check for the highest LSN; if
+  its own node has the highest LSN, it will attempt to promote itself (using the
+  command defined in `promote_command` in `repmgr.conf`. Note that if using
+  `repmgr standby promote` as the promotion command, and the original master becomes available
+  before the promotion takes effect, `repmgr` will return an error and no promotion
+  will take place, and `repmgrd` will resume monitoring as usual.
+
+* if the node is not the promotion candidate, `repmgrd` will execute the
+  `follow_command` defined in `repmgr.conf`. If using `repmgr standby follow` here,
+  `repmgr` will attempt to detect the new master node and attach to that.
+
+
+
+
diff --git a/docs/repmgrd-node-fencing.md b/docs/repmgrd-node-fencing.md
new file mode 100644
index 00000000..ecfd13c4
--- /dev/null
+++ b/docs/repmgrd-node-fencing.md
@@ -0,0 +1,150 @@
+Fencing a failed master node with repmgrd and pgbouncer
+=======================================================
+
+With automatic failover, it's essential to ensure that a failed master
+remains inaccessible to your application, even if it comes back online
+again, to avoid a split-brain situation.
+
+By using `pgbouncer` together with `repmgrd`, it's possible to combine
+automatic failover with a process to isolate the failed master from
+your application and ensure that all connections which should go to
+the master are directed there smoothly without having to reconfigure
+your application. (Note that as a connection pooler, `pgbouncer` can
+benefit your application in other ways, but those are beyond the scope
+of this document).
+
+* * *
+
+> *WARNING*: automatic failover is tricky to get right. This document
+> demonstrates one possible implementation method, however you should
+> carefully configure and test any setup to suit the needs of your own
+> replication cluster/application.
+
+* * *
+
+In a failover situation, `repmgrd` promotes a standby to master by
+executing the command defined in `promote_command`. Normally this
+would be something like:
+
+    repmgr standby promote -f /etc/repmgr.conf
+
+By wrapping this in a custom script which adjusts the `pgbouncer`
+configuration on all nodes, it's possible to fence the failed master
+and redirect write connections to the new master.
+
+The script consists of three sections:
+
+* commands to pause `pgbouncer` on all nodes
+* the promotion command itself
+* commands to reconfigure and restart `pgbouncer` on all nodes
+
+Note that it requires password-less SSH access between all nodes to be
+able to update the `pgbouncer` configuration files.
+
+For the purposes of this demonstration, we'll assume there are 3 nodes
+(master and two standbys), with `pgbouncer` listening on port 6432
+handling connections to a database called `appdb`. The `postgres`
+system user must have write access to the `pgbouncer` configuration
+file on all nodes, assumed to be at `/etc/pgbouncer.ini`.
+
+The script also requires a template file containing global `pgbouncer`
+configuration, which should looks something like this (adjust
+settings appropriately for your environment):
+
+`/var/lib/postgres/repmgr/pgbouncer.ini.template`
+
+    [pgbouncer]
+
+    logfile = /var/log/pgbouncer/pgbouncer.log
+    pidfile = /var/run/pgbouncer/pgbouncer.pid
+
+    listen_addr = *
+    listen_port = 6532
+    unix_socket_dir = /tmp
+
+    auth_type = trust
+    auth_file = /etc/pgbouncer.auth
+
+    admin_users = postgres
+    stats_users = postgres
+
+    pool_mode = transaction
+
+    max_client_conn = 100
+    default_pool_size = 20
+    min_pool_size = 5
+    reserve_pool_size = 5
+    reserve_pool_timeout = 3
+
+    log_connections = 1
+    log_disconnections = 1
+    log_pooler_errors = 1
+
+The actual script is as follows; adjust the configurable items as appropriate:
+
+`/var/lib/postgres/repmgr/promote.sh`
+
+
+    #!/usr/bin/env bash
+    set -u
+    set -e
+
+    # Configurable items
+    PGBOUNCER_HOSTS="node1 node2 node3"
+    REPMGR_DB="repmgr"
+    REPMGR_USER="repmgr"
+    REPMGR_SCHEMA="repmgr_test"
+    PGBOUNCER_CONFIG="/etc/pgbouncer.ini"
+    PGBOUNCER_INI_TEMPLATE="/var/lib/postgres/repmgr/pgbouncer.ini.template"
+    PGBOUNCER_DATABASE="appdb"
+
+    # 1. Pause running pgbouncer instances
+    for HOST in $PGBOUNCER_HOSTS
+    do
+        psql -t -c "pause" -h $HOST -p $PORT -U postgres pgbouncer
+    done
+
+
+    # 2. Promote this node from standby to master
+
+    repmgr standby promote -f /etc/repmgr.conf
+
+
+    # 3. Reconfigure pgbouncer instances
+
+    PGBOUNCER_INI_NEW="/tmp/pgbouncer.ini.new"
+
+    for HOST in $PGBOUNCER_HOSTS
+    do
+        # Recreate the pgbouncer config file
+        echo -e "[databases]\n" > $PGBOUNCER_INI_NEW
+
+        psql -d $REPMGR_DB -U $REPMGR_USER -t -A \
+          -c "SELECT '$PGBOUNCER_DATABASE= ' || conninfo || ' application_name=pgbouncer_$HOST' \
+              FROM $REPMGR_SCHEMA.repl_nodes \
+              WHERE active = TRUE AND type='master'" >> $PGBOUNCER_INI_NEW
+
+        cat $PGBOUNCER_INI_TEMPLATE >> $PGBOUNCER_INI_NEW
+
+        rsync $PGBOUNCER_INI_NEW $HOST:$PGBOUNCER_CONFIG
+
+        psql -tc "reload" -h $HOST -U postgres pgbouncer
+        psql -tc "resume" -h $HOST -U postgres pgbouncer
+
+    done
+
+    # Clean up generated file
+    rm $PGBOUNCER_INI_NEW
+
+    echo "Reconfiguration of pgbouncer complete"
+
+Script and template file should be installed on each node where
+`repmgrd` is running.
+
+Finally, set `promote_command` in `repmgr.conf` on each node to
+point to the custom promote script:
+
+    promote_command=/var/lib/postgres/repmgr/promote.sh
+
+and reload/restart any running `repmgrd` instances for the changes to take
+effect.