From 5945accd8401c88e506c73bd404f44c69524c7b2 Mon Sep 17 00:00:00 2001 From: Ian Barwick Date: Wed, 5 Oct 2016 11:25:36 +0900 Subject: [PATCH] Add documentation for repmgrd failover process and failed node fencing Addresses GitHub #200. --- README.md | 21 +++- docs/repmgrd-failover-mechanism.md | 75 +++++++++++++++ docs/repmgrd-node-fencing.md | 150 +++++++++++++++++++++++++++++ 3 files changed, 243 insertions(+), 3 deletions(-) create mode 100644 docs/repmgrd-failover-mechanism.md create mode 100644 docs/repmgrd-node-fencing.md diff --git a/README.md b/README.md index 31225880..8c236c8a 100644 --- a/README.md +++ b/README.md @@ -580,13 +580,13 @@ base backups and WAL files. Barman support provides the following advantages: -- the primary node does not need to perform a new backup every time a +- the master node does not need to perform a new backup every time a new standby is cloned; - a standby node can be disconnected for longer periods without losing the ability to catch up, and without causing accumulation of WAL - files on the primary node; + files on the master node; - therefore, `repmgr` does not need to use replication slots, and the - primary node does not need to set `wal_keep_segments`. + master node does not need to set `wal_keep_segments`. > *NOTE*: In view of the above, Barman support is incompatible with > the `use_replication_slots` setting in `repmgr.conf`. @@ -1743,6 +1743,21 @@ which contains connection details for the local database. the current working directory; no additional arguments are required. +### Further documentation + +As well as this README, the `repmgr` source contains following additional +documentation files: + +* FAQ.md - frequently asked questions +* CONTRIBUTING.md - how to contribute to `repmgr` +* PACKAGES.md - details on building packages +* SSH-RSYNC.md - how to set up passwordless SSH between nodes +* docs/repmgrd-failover-mechanism.md - how repmgrd picks which node to promote +* docs/repmgrd-node-fencing.md - how to "fence" a failed master node + + + + ### Error codes `repmgr` or `repmgrd` will return one of the following error codes on program diff --git a/docs/repmgrd-failover-mechanism.md b/docs/repmgrd-failover-mechanism.md new file mode 100644 index 00000000..b6d013ae --- /dev/null +++ b/docs/repmgrd-failover-mechanism.md @@ -0,0 +1,75 @@ +repmgrd's failover algorithm +============================ + +When implementing automatic failover, there are two factors which are critical in +ensuring the desired result is achieved: + + - has the master node genuinely failed? + - which is the best node to promote to the new master? + +This document outlines repmgrd's decision-making process during automatic failover +for standbys directly connected to the master node. + + +Master node failure detection +----------------------------- + +If a `repmgrd` instance running on a PostgreSQL standby node is unable to connect to +the master node, this doesn't neccesarily mean that the master is down and a +failover is required. Factors such as network connectivity issues could mean that +even though the standby node is isolated, the replication cluster as a whole +is functioning correctly, and promoting the standby without further verification +could result in a "split-brain" situation. + +In the event that `repmgrd` is unable to connect to the master node, it will attempt +to reconnect to the master server several times (as defined by the `reconnect_attempts` +parameter in `repmgr.conf`), with reconnection attempts occuring at the interval +specified by `reconnect_interval`. This happens to verify that the master is definitively +not accessible (e.g. that connection was not lost due to a brief network glitch). + +Appropriate values for these settings will depend very much on the replication +cluster environment. There will necessarily be a trade-off between the time it +takes to assume the master is not reachable, and the reliability of that conclusion. +A standby in a different physical location to the master will probably need a longer +check interval to rule out possible network issues, whereas one located in the same +rack with a direct connection between servers could perform the check very quickly. + +Note that it's possible the master comes back online after this point is reached, +but before a new master has been selected; in this case it will be noticed +during the selection of a new master and no actual failover will take place. + +Promotion candidate selection +----------------------------- + +Once `repmgrd` has decided the master is definitively unreachable, following checks +will be carried out: + +* attempts to connect to all other nodes in the cluster (including the witness + node, if defined) to establish the state of the cluster, including their + current LSN + +* If less than half of the nodes are visible (from the viewpoint + of this node), `repmgrd` will not take any further action. This is to ensure that + e.g. if a replication cluster is spread over multiple data centres, a split-brain + situation does not occur if there is a network failure between datacentres. Note + that if nodes are split evenly between data centres, a witness server can be + used to establish the "majority" daat centre. + +* `repmgrd` polls all visible servers and waits for each node to return a valid LSN; + it updates the LSN previously stored for this node if it has increased since + the initial check + +* once all LSNs have been retrieved, `repmgrd` will check for the highest LSN; if + its own node has the highest LSN, it will attempt to promote itself (using the + command defined in `promote_command` in `repmgr.conf`. Note that if using + `repmgr standby promote` as the promotion command, and the original master becomes available + before the promotion takes effect, `repmgr` will return an error and no promotion + will take place, and `repmgrd` will resume monitoring as usual. + +* if the node is not the promotion candidate, `repmgrd` will execute the + `follow_command` defined in `repmgr.conf`. If using `repmgr standby follow` here, + `repmgr` will attempt to detect the new master node and attach to that. + + + + diff --git a/docs/repmgrd-node-fencing.md b/docs/repmgrd-node-fencing.md new file mode 100644 index 00000000..ecfd13c4 --- /dev/null +++ b/docs/repmgrd-node-fencing.md @@ -0,0 +1,150 @@ +Fencing a failed master node with repmgrd and pgbouncer +======================================================= + +With automatic failover, it's essential to ensure that a failed master +remains inaccessible to your application, even if it comes back online +again, to avoid a split-brain situation. + +By using `pgbouncer` together with `repmgrd`, it's possible to combine +automatic failover with a process to isolate the failed master from +your application and ensure that all connections which should go to +the master are directed there smoothly without having to reconfigure +your application. (Note that as a connection pooler, `pgbouncer` can +benefit your application in other ways, but those are beyond the scope +of this document). + +* * * + +> *WARNING*: automatic failover is tricky to get right. This document +> demonstrates one possible implementation method, however you should +> carefully configure and test any setup to suit the needs of your own +> replication cluster/application. + +* * * + +In a failover situation, `repmgrd` promotes a standby to master by +executing the command defined in `promote_command`. Normally this +would be something like: + + repmgr standby promote -f /etc/repmgr.conf + +By wrapping this in a custom script which adjusts the `pgbouncer` +configuration on all nodes, it's possible to fence the failed master +and redirect write connections to the new master. + +The script consists of three sections: + +* commands to pause `pgbouncer` on all nodes +* the promotion command itself +* commands to reconfigure and restart `pgbouncer` on all nodes + +Note that it requires password-less SSH access between all nodes to be +able to update the `pgbouncer` configuration files. + +For the purposes of this demonstration, we'll assume there are 3 nodes +(master and two standbys), with `pgbouncer` listening on port 6432 +handling connections to a database called `appdb`. The `postgres` +system user must have write access to the `pgbouncer` configuration +file on all nodes, assumed to be at `/etc/pgbouncer.ini`. + +The script also requires a template file containing global `pgbouncer` +configuration, which should looks something like this (adjust +settings appropriately for your environment): + +`/var/lib/postgres/repmgr/pgbouncer.ini.template` + + [pgbouncer] + + logfile = /var/log/pgbouncer/pgbouncer.log + pidfile = /var/run/pgbouncer/pgbouncer.pid + + listen_addr = * + listen_port = 6532 + unix_socket_dir = /tmp + + auth_type = trust + auth_file = /etc/pgbouncer.auth + + admin_users = postgres + stats_users = postgres + + pool_mode = transaction + + max_client_conn = 100 + default_pool_size = 20 + min_pool_size = 5 + reserve_pool_size = 5 + reserve_pool_timeout = 3 + + log_connections = 1 + log_disconnections = 1 + log_pooler_errors = 1 + +The actual script is as follows; adjust the configurable items as appropriate: + +`/var/lib/postgres/repmgr/promote.sh` + + + #!/usr/bin/env bash + set -u + set -e + + # Configurable items + PGBOUNCER_HOSTS="node1 node2 node3" + REPMGR_DB="repmgr" + REPMGR_USER="repmgr" + REPMGR_SCHEMA="repmgr_test" + PGBOUNCER_CONFIG="/etc/pgbouncer.ini" + PGBOUNCER_INI_TEMPLATE="/var/lib/postgres/repmgr/pgbouncer.ini.template" + PGBOUNCER_DATABASE="appdb" + + # 1. Pause running pgbouncer instances + for HOST in $PGBOUNCER_HOSTS + do + psql -t -c "pause" -h $HOST -p $PORT -U postgres pgbouncer + done + + + # 2. Promote this node from standby to master + + repmgr standby promote -f /etc/repmgr.conf + + + # 3. Reconfigure pgbouncer instances + + PGBOUNCER_INI_NEW="/tmp/pgbouncer.ini.new" + + for HOST in $PGBOUNCER_HOSTS + do + # Recreate the pgbouncer config file + echo -e "[databases]\n" > $PGBOUNCER_INI_NEW + + psql -d $REPMGR_DB -U $REPMGR_USER -t -A \ + -c "SELECT '$PGBOUNCER_DATABASE= ' || conninfo || ' application_name=pgbouncer_$HOST' \ + FROM $REPMGR_SCHEMA.repl_nodes \ + WHERE active = TRUE AND type='master'" >> $PGBOUNCER_INI_NEW + + cat $PGBOUNCER_INI_TEMPLATE >> $PGBOUNCER_INI_NEW + + rsync $PGBOUNCER_INI_NEW $HOST:$PGBOUNCER_CONFIG + + psql -tc "reload" -h $HOST -U postgres pgbouncer + psql -tc "resume" -h $HOST -U postgres pgbouncer + + done + + # Clean up generated file + rm $PGBOUNCER_INI_NEW + + echo "Reconfiguration of pgbouncer complete" + +Script and template file should be installed on each node where +`repmgrd` is running. + +Finally, set `promote_command` in `repmgr.conf` on each node to +point to the custom promote script: + + promote_command=/var/lib/postgres/repmgr/promote.sh + +and reload/restart any running `repmgrd` instances for the changes to take +effect.