Add documentation for repmgrd failover process and failed node fencing

Addresses GitHub #200.
2026-07-16 14:29:05 +00:00 · 2016-10-05 11:25:36 +09:00
parent eb90f864c9
commit 2fae788bc4
3 changed files with 243 additions and 3 deletions
@@ -580,13 +580,13 @@ base backups and WAL files.
 Barman support provides the following advantages:
- the primary node does not need to perform a new backup every time a
+- the master node does not need to perform a new backup every time a
  new standby is cloned;
 - a standby node can be disconnected for longer periods without losing
  the ability to catch up, and without causing accumulation of WAL
-  files on the primary node;
+  files on the master node;
 - therefore, `repmgr` does not need to use replication slots, and the
-  primary node does not need to set `wal_keep_segments`.
+  master node does not need to set `wal_keep_segments`.
 > *NOTE*: In view of the above, Barman support is incompatible with
 > the `use_replication_slots` setting in `repmgr.conf`.
@@ -1743,6 +1743,21 @@ which contains connection details for the local database.
    the current working directory; no additional arguments are required.
 ### Further documentation
 As well as this README, the `repmgr` source contains following additional
 documentation files:
 * FAQ.md - frequently asked questions
 * CONTRIBUTING.md - how to contribute to `repmgr`
 * PACKAGES.md - details on building packages
 * SSH-RSYNC.md - how to set up passwordless SSH between nodes
 * docs/repmgrd-failover-mechanism.md - how repmgrd picks which node to promote
 * docs/repmgrd-node-fencing.md - how to "fence" a failed master node
 ### Error codes
 `repmgr` or `repmgrd` will return one of the following error codes on program
@@ -0,0 +1,75 @@
 repmgrd's failover algorithm
 ============================
 When implementing automatic failover, there are two factors which are critical in
 ensuring the desired result is achieved:
  - has the master node genuinely failed?
  - which is the best node to promote to the new master?
 This document outlines repmgrd's decision-making process during automatic failover
 for standbys directly connected to the master node.
 Master node failure detection
 -----------------------------
 If a `repmgrd` instance running on a PostgreSQL standby node is unable to connect to
 the master node, this doesn't neccesarily mean that the master is down and a
 failover is required. Factors such as network connectivity issues could mean that
 even though the standby node is isolated, the replication cluster as a whole
 is functioning correctly, and promoting the standby without further verification
 could result in a "split-brain" situation.
 In the event that `repmgrd` is unable to connect to the master node, it will attempt
 to reconnect to the master server several times (as defined by the `reconnect_attempts`
 parameter in `repmgr.conf`), with reconnection attempts  occuring at the interval
 specified by `reconnect_interval`. This happens to verify that the master is definitively
 not accessible (e.g. that connection was not lost due to a brief network glitch).
 Appropriate values for these settings will depend very much on the replication
 cluster environment. There will necessarily be a trade-off between the time it
 takes to assume the master is not reachable, and the reliability of that conclusion.
 A standby in a different physical location to the master will probably need a longer
 check interval to rule out possible network issues, whereas one located in the same
 rack with a direct connection between servers could perform the check very quickly.
 Note that it's possible the master comes back online after this point is reached,
 but before a new master has been selected; in this case it will be noticed
 during the selection of a new master and no actual failover will take place.
 Promotion candidate selection
 -----------------------------
 Once `repmgrd` has decided the master is definitively unreachable, following checks
 will be carried out:
 * attempts to connect to all other nodes in the cluster (including the witness
  node, if defined) to establish the state of the cluster, including their
  current LSN
 * If less than half of the nodes are visible (from the viewpoint
  of this node), `repmgrd` will not take any further action. This is to ensure that
  e.g. if a replication cluster is spread over multiple data centres, a split-brain
  situation does not occur if there is a network failure between datacentres. Note
  that if nodes are split evenly between data centres, a witness server can be
  used to establish the "majority" daat centre.
 * `repmgrd` polls all visible servers and waits for each node to return a valid LSN;
  it updates the LSN previously  stored for this node if it has increased since
  the initial check
 * once all LSNs have been retrieved, `repmgrd` will check for the highest LSN; if
  its own node has the highest LSN, it will attempt to promote itself (using the
  command defined in `promote_command` in `repmgr.conf`. Note that if using
  `repmgr standby promote` as the promotion command, and the original master becomes available
  before the promotion takes effect, `repmgr` will return an error and no promotion
  will take place, and `repmgrd` will resume monitoring as usual.
 * if the node is not the promotion candidate, `repmgrd` will execute the
  `follow_command` defined in `repmgr.conf`. If using `repmgr standby follow` here,
  `repmgr` will attempt to detect the new master node and attach to that.
@@ -0,0 +1,150 @@
 Fencing a failed master node with repmgrd and pgbouncer
 =======================================================
 With automatic failover, it's essential to ensure that a failed master
 remains inaccessible to your application, even if it comes back online
 again, to avoid a split-brain situation.
 By using `pgbouncer` together with `repmgrd`, it's possible to combine
 automatic failover with a process to isolate the failed master from
 your application and ensure that all connections which should go to
 the master are directed there smoothly without having to reconfigure
 your application. (Note that as a connection pooler, `pgbouncer` can
 benefit your application in other ways, but those are beyond the scope
 of this document).
 * * *
 > *WARNING*: automatic failover is tricky to get right. This document
 > demonstrates one possible implementation method, however you should
 > carefully configure and test any setup to suit the needs of your own
 > replication cluster/application.
 * * *
 In a failover situation, `repmgrd` promotes a standby to master by
 executing the command defined in `promote_command`. Normally this
 would be something like:
    repmgr standby promote -f /etc/repmgr.conf
 By wrapping this in a custom script which adjusts the `pgbouncer`
 configuration on all nodes, it's possible to fence the failed master
 and redirect write connections to the new master.
 The script consists of three sections:
 * commands to pause `pgbouncer` on all nodes
 * the promotion command itself
 * commands to reconfigure and restart `pgbouncer` on all nodes
 Note that it requires password-less SSH access between all nodes to be
 able to update the `pgbouncer` configuration files.
 For the purposes of this demonstration, we'll assume there are 3 nodes
 (master and two standbys), with `pgbouncer` listening on port 6432
 handling connections to a database called `appdb`. The `postgres`
 system user must have write access to the `pgbouncer` configuration
 file on all nodes, assumed to be at `/etc/pgbouncer.ini`.
 The script also requires a template file containing global `pgbouncer`
 configuration, which should looks something like this (adjust
 settings appropriately for your environment):
 `/var/lib/postgres/repmgr/pgbouncer.ini.template`
    [pgbouncer]
    logfile = /var/log/pgbouncer/pgbouncer.log
    pidfile = /var/run/pgbouncer/pgbouncer.pid
    listen_addr = *
    listen_port = 6532
    unix_socket_dir = /tmp
    auth_type = trust
    auth_file = /etc/pgbouncer.auth
    admin_users = postgres
    stats_users = postgres
    pool_mode = transaction
    max_client_conn = 100
    default_pool_size = 20
    min_pool_size = 5
    reserve_pool_size = 5
    reserve_pool_timeout = 3
    log_connections = 1
    log_disconnections = 1
    log_pooler_errors = 1
 The actual script is as follows; adjust the configurable items as appropriate:
 `/var/lib/postgres/repmgr/promote.sh`
    #!/usr/bin/env bash
    set -u
    set -e
    # Configurable items
    PGBOUNCER_HOSTS="node1 node2 node3"
    REPMGR_DB="repmgr"
    REPMGR_USER="repmgr"
    REPMGR_SCHEMA="repmgr_test"
    PGBOUNCER_CONFIG="/etc/pgbouncer.ini"
    PGBOUNCER_INI_TEMPLATE="/var/lib/postgres/repmgr/pgbouncer.ini.template"
    PGBOUNCER_DATABASE="appdb"
    # 1. Pause running pgbouncer instances
    for HOST in $PGBOUNCER_HOSTS
    do
        psql -t -c "pause" -h $HOST -p $PORT -U postgres pgbouncer
    done
    # 2. Promote this node from standby to master
    repmgr standby promote -f /etc/repmgr.conf
    # 3. Reconfigure pgbouncer instances
    PGBOUNCER_INI_NEW="/tmp/pgbouncer.ini.new"
    for HOST in $PGBOUNCER_HOSTS
    do
        # Recreate the pgbouncer config file
        echo -e "[databases]\n" > $PGBOUNCER_INI_NEW
        psql -d $REPMGR_DB -U $REPMGR_USER -t -A \
          -c "SELECT '$PGBOUNCER_DATABASE= ' || conninfo || ' application_name=pgbouncer_$HOST' \
              FROM $REPMGR_SCHEMA.repl_nodes \
              WHERE active = TRUE AND type='master'" >> $PGBOUNCER_INI_NEW
        cat $PGBOUNCER_INI_TEMPLATE >> $PGBOUNCER_INI_NEW
        rsync $PGBOUNCER_INI_NEW $HOST:$PGBOUNCER_CONFIG
        psql -tc "reload" -h $HOST -U postgres pgbouncer
        psql -tc "resume" -h $HOST -U postgres pgbouncer
    done
    # Clean up generated file
    rm $PGBOUNCER_INI_NEW
    echo "Reconfiguration of pgbouncer complete"
 Script and template file should be installed on each node where
 `repmgrd` is running.
 Finally, set `promote_command` in `repmgr.conf` on each node to
 point to the custom promote script:
    promote_command=/var/lib/postgres/repmgr/promote.sh
 and reload/restart any running `repmgrd` instances for the changes to take
 effect.