From 11e5993bf5e4b62cb7033576ff80ad820e8361e2 Mon Sep 17 00:00:00 2001 From: Ian Barwick Date: Wed, 13 Mar 2019 15:28:16 +0900 Subject: [PATCH] doc: merge repmgrd notes into operation documentation --- doc/filelist.sgml | 1 - doc/repmgr.sgml | 1 - doc/repmgrd-notes.sgml | 38 ------- doc/repmgrd-operation.sgml | 216 +++++++++++++++++++++++++++++++++++++ 4 files changed, 216 insertions(+), 40 deletions(-) delete mode 100644 doc/repmgrd-notes.sgml create mode 100644 doc/repmgrd-operation.sgml diff --git a/doc/filelist.sgml b/doc/filelist.sgml index 1e240de6..d3f8b5a0 100644 --- a/doc/filelist.sgml +++ b/doc/filelist.sgml @@ -58,7 +58,6 @@ - diff --git a/doc/repmgr.sgml b/doc/repmgr.sgml index 0f2b4888..39292d98 100644 --- a/doc/repmgr.sgml +++ b/doc/repmgr.sgml @@ -88,7 +88,6 @@ &repmgrd-witness-server; &repmgrd-degraded-monitoring; &repmgrd-monitoring; - &repmgrd-notes; &repmgrd-bdr; diff --git a/doc/repmgrd-notes.sgml b/doc/repmgrd-notes.sgml deleted file mode 100644 index 31910758..00000000 --- a/doc/repmgrd-notes.sgml +++ /dev/null @@ -1,38 +0,0 @@ - - - - repmgrd - notes - - repmgrd notes - - - - repmgrd - paused WAL replay - - - repmgrd and paused WAL replay - - If WAL replay has been paused (using pg_wal_replay_pause(), - on PostgreSQL 9.6 and earlier pg_xlog_replay_pause()), - in a failover situation repmgrd will - automatically resume WAL replay. - - - This is because if WAL replay is paused, but WAL is pending replay, - PostgreSQL cannot be promoted until WAL replay is resumed. - - - - repmgr standby promote - will refuse to promote a node in this state, as the PostgreSQL - promote command will not be acted on until - WAL replay is resumed, leaving the cluster in a potentially - unstable state. In this case it is up to the user to - decide whether to resume WAL replay. - - - - - diff --git a/doc/repmgrd-operation.sgml b/doc/repmgrd-operation.sgml new file mode 100644 index 00000000..29a029b6 --- /dev/null +++ b/doc/repmgrd-operation.sgml @@ -0,0 +1,216 @@ + + + repmgrd + operation + + + repmgrd operation + + + + + + repmgrd + pausing + + + + pausing repmgrd + + + Pausing repmgrd + + + In normal operation, repmgrd monitors the state of the + PostgreSQL node it is running on, and will take appropriate action if problems + are detected, e.g. (if so configured) promote the node to primary, if the existing + primary has been determined as failed. + + + + However, repmgrd is unable to distinguish between + planned outages (such as performing a switchover + or installing PostgreSQL maintenance released), and an actual server outage. In versions prior to + &repmgr; 4.2 it was necessary to stop repmgrd on all nodes (or at least + on all nodes where repmgrd is + configured for automatic failover) + to prevent repmgrd from making unintentional changes to the + replication cluster. + + + + From &repmgr; 4.2, repmgrd + can now be "paused", i.e. instructed not to take any action such as performing a failover. + This can be done from any node in the cluster, removing the need to stop/restart + each repmgrd individually. + + + + + For major PostgreSQL upgrades, e.g. from PostgreSQL 10 to PostgreSQL 11, + repmgrd should be shut down completely and only started up + once the &repmgr; packages for the new PostgreSQL major version have been installed. + + + + + Prerequisites for pausing <application>repmgrd</application> + + In order to be able to pause/unpause repmgrd, following + prerequisites must be met: + + + + &repmgr; 4.2 or later must be installed on all nodes. + + + + The same major &repmgr; version (e.g. 4.2) must be installed on all nodes (and preferably the same minor version). + + + + + PostgreSQL on all nodes must be accessible from the node where the + pause/unpause operation is executed, using the + conninfo string shown by repmgr cluster show. + + + + + + + These conditions are required for normal &repmgr; operation in any case. + + + + + + + Pausing/unpausing <application>repmgrd</application> + + To pause repmgrd, execute repmgr daemon pause, e.g.: + +$ repmgr -f /etc/repmgr.conf daemon pause +NOTICE: node 1 (node1) paused +NOTICE: node 2 (node2) paused +NOTICE: node 3 (node3) paused + + + The state of repmgrd on each node can be checked with + repmgr daemon status, e.g.: + $ repmgr -f /etc/repmgr.conf daemon status + ID | Name | Role | Status | repmgrd | PID | Paused? +----+-------+---------+---------+---------+------+--------- + 1 | node1 | primary | running | running | 7851 | yes + 2 | node2 | standby | running | running | 7889 | yes + 3 | node3 | standby | running | running | 7918 | yes + + + + + If executing a switchover with repmgr standby switchover, + &repmgr; will automatically pause/unpause repmgrd as part of the switchover process. + + + + + If the primary (in this example, node1) is stopped, repmgrd + running on one of the standbys (here: node2) will react like this: + +[2018-09-20 12:22:21] [WARNING] unable to connect to upstream node "node1" (node ID: 1) +[2018-09-20 12:22:21] [INFO] checking state of node 1, 1 of 5 attempts +[2018-09-20 12:22:21] [INFO] sleeping 1 seconds until next reconnection attempt +... +[2018-09-20 12:22:24] [INFO] sleeping 1 seconds until next reconnection attempt +[2018-09-20 12:22:25] [INFO] checking state of node 1, 5 of 5 attempts +[2018-09-20 12:22:25] [WARNING] unable to reconnect to node 1 after 5 attempts +[2018-09-20 12:22:25] [NOTICE] node is paused +[2018-09-20 12:22:33] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state +[2018-09-20 12:22:33] [DETAIL] repmgrd paused by administrator +[2018-09-20 12:22:33] [HINT] execute "repmgr daemon unpause" to resume normal failover mode + + + If the primary becomes available again (e.g. following a software upgrade), repmgrd + will automatically reconnect, e.g.: + +[2018-09-20 13:12:41] [NOTICE] reconnected to upstream node 1 after 8 seconds, resuming monitoring + + + + To unpause repmgrd, execute repmgr daemon unpause, e.g.: + +$ repmgr -f /etc/repmgr.conf daemon unpause +NOTICE: node 1 (node1) unpaused +NOTICE: node 2 (node2) unpaused +NOTICE: node 3 (node3) unpaused + + + + + If the previous primary is no longer accessible when repmgrd + is unpaused, no failover action will be taken. Instead, a new primary must be manually promoted using + repmgr standby promote, + and any standbys attached to the new primary with + repmgr standby follow. + + + This is to prevent repmgr daemon unpause + resulting in the automatic promotion of a new primary, which may be a problem particularly + in larger clusters, where repmgrd could select a different promotion + candidate to the one intended by the administrator. + + + + + Details on the <application>repmgrd</application> pausing mechanism + + + The pause state of each node will be stored over a PostgreSQL restart. + + + + repmgr daemon pause and + repmgr daemon unpause can be + executed even if repmgrd is not running; in this case, + repmgrd will start up in whichever pause state has been set. + + + + repmgr daemon pause and + repmgr daemon unpause + do not stop/start repmgrd. + + + + + + + + repmgrd + paused WAL replay + + + repmgrd and paused WAL replay + + If WAL replay has been paused (using pg_wal_replay_pause(), + on PostgreSQL 9.6 and earlier pg_xlog_replay_pause()), + in a failover situation repmgrd will + automatically resume WAL replay. + + + This is because if WAL replay is paused, but WAL is pending replay, + PostgreSQL cannot be promoted until WAL replay is resumed. + + + + repmgr standby promote + will refuse to promote a node in this state, as the PostgreSQL + promote command will not be acted on until + WAL replay is resumed, leaving the cluster in a potentially + unstable state. In this case it is up to the user to + decide whether to resume WAL replay. + + + + +