diff --git a/doc/switchover.sgml b/doc/switchover.sgml new file mode 100644 index 00000000..4a07a4a1 --- /dev/null +++ b/doc/switchover.sgml @@ -0,0 +1,204 @@ + + Performing a switchover with repmgr + + A typical use-case for replication is a combination of primary and standby + server, with the standby serving as a backup which can easily be activated + in case of a problem with the primary. Such an unplanned failover would + normally be handled by promoting the standby, after which an appropriate + action must be taken to restore the old primary. + + + In some cases however it's desirable to promote the standby in a planned + way, e.g. so maintenance can be performed on the primary; this kind of switchover + is supported by the command. + + + repmgr standby switchover differs from other &repmgr; + actions in that it lso performs actions on another server (the demotion + candidate), which means passwordless SSH access is required to that server + from the one where repmgr standby switchover is executed. + + + + repmgr standby switchover performs a relatively complex + series of operations on two servers, and should therefore be performed after + careful preparation and with adequate attention. In particular you should + be confident that your network environment is stable and reliable. + + + Additionally you should be sure that the current primary can be shut down + quickly and cleanly. In particular, access from applications should be + minimalized or preferably blocked completely. Also be aware that if there + is a backlog of files waiting to be archived, PostgreSQL will not shut + down until archiving completes. + + + We recommend running repmgr standby switchover at the + most verbose logging level (--log-level=DEBUG --verbose) + and capturing all output to assist troubleshooting any problems. + + + Please also read carefully the sections and + `Caveats` below. + + + + + + switchover + preparation + + Preparing for switchover + + As mentioned above, success of the switchover operation depends on &repmgr; + being able to shut down the current primary server quickly and cleanly. + + + Double-check which commands will be used to stop/start/restart the current + primary; on the primary execute: + + repmgr -f /etc./repmgr.conf node service --list --action=stop + repmgr -f /etc./repmgr.conf node service --list --action=start + repmgr -f /etc./repmgr.conf node service --list --action=restart + + + + + On systemd systems we strongly recommend using the appropriate + systemctl commands (typically run via sudo) to ensure + systemd informed about the status of the PostgreSQL service. + + + + Check that access from applications is minimalized or preferably blocked + completely, so applications are not unexpectedly interrupted. + + + Check there is no significant replication lag on standbys attached to the + current primary. + + + If WAL file archiving is set up, check that there is no backlog of files waiting + to be archived, as PostgreSQL will not finally shut down until all these have been + archived. If there is a backlog exceeding archive_ready_warning WAL files, + `repmgr` will emit a warning before attempting to perform a switchover; you can also check + manually with repmgr node check --archive-ready. + + + Ensure that repmgrd is *not* running anywhere to prevent it unintentionally + promoting a node. + + + Finally, consider executing repmgr standby switchover with the + --dry-run option; this will perform any necessary checks and inform you about + success/failure, and stop before the first actual command is run (which would be the shutdown of the + current primary). Example output: + + $ repmgr standby switchover -f /etc/repmgr.conf --siblings-follow --dry-run + NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode + INFO: SSH connection to host "localhost" succeeded + INFO: archive mode is "off" + INFO: replication lag on this standby is 0 seconds + INFO: all sibling nodes are reachable via SSH + NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby + INFO: following shutdown command would be run on node "node1": + "pg_ctl -l /var/log/postgresql/startup.log -D '/var/lib/postgresql/data' -m fast -W stop" + + + + + + + switchover + execution + + Executing the switchover command + + To demonstrate switchover, we will assume a replication cluster with a + primary (node1) and one standby (node2); + after the switchover node2 should become the primary with + node1 following it. + + + The switchover command must be run from the standby which is to be promoted, + and in its simplest form looks like this: + + $ repmgr -f /etc/repmgr.conf standby switchover + NOTICE: executing switchover on node "node2" (ID: 2) + INFO: searching for primary node + INFO: checking if node 1 is primary + INFO: current primary node is 1 + INFO: SSH connection to host "localhost" succeeded + INFO: archive mode is "off" + INFO: replication lag on this standby is 0 seconds + NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby + NOTICE: stopping current primary node "node1" (ID: 1) + NOTICE: issuing CHECKPOINT + DETAIL: executing server command "pg_ctl -l /var/log/postgres/startup.log -D '/var/lib/pgsql/data' -m fast -W stop" + INFO: checking primary status; 1 of 6 attempts + NOTICE: current primary has been cleanly shut down at location 0/3001460 + NOTICE: promoting standby to primary + DETAIL: promoting server "node2" (ID: 2) using "pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' promote" + server promoting + NOTICE: STANDBY PROMOTE successful + DETAIL: server "node2" (ID: 2) was successfully promoted to primary + INFO: setting node 1's primary to node 2 + NOTICE: starting server using "pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' restart" + NOTICE: NODE REJOIN successful + DETAIL: node 1 is now attached to node 2 + NOTICE: switchover was successful + DETAIL: node "node2" is now primary + NOTICE: STANDBY SWITCHOVER is complete + + + + The old primary is now replicating as a standby from the new primary, and the + cluster status will now look like this: + + $ repmgr -f /etc/repmgr.conf cluster show + ID | Name | Role | Status | Upstream | Location | Connection string + ----+-------+---------+-----------+----------+----------+-------------------------------------- + 1 | node1 | standby | running | node2 | default | host=node1 dbname=repmgr user=repmgr + 2 | node2 | primary | * running | | default | host=node2 dbname=repmgr user=repmgr + + + + + + switchover + caveats + + Caveats + + + + + If using PostgreSQL 9.3 or 9.4, you should ensure that the shutdown command + is configured to use PostgreSQL's fast shutdown mode (the default in 9.5 + and later). If relying on pg_ctl to perform database server operations, + you should include -m fast in pg_ctl_options + in repmgr.conf. + + + + + pg_rewind *requires* that either wal_log_hints is enabled, or that + data checksums were enabled when the cluster was initialized. See the + pg_rewind documentation + for details. + + + + + repmgrd should not be running with setting failover=automatic + in repmgr.conf when a switchover is carried out, otherwise the + repmgrd daemon may try and promote a standby by itself. + + + + + + We hope to remove some of these restrictions in future versions of `repmgr`. + + +