From 2e67bc1341b33c3577680b7b1f7a53b155c01cf3 Mon Sep 17 00:00:00 2001 From: Ian Barwick Date: Wed, 13 Mar 2019 14:53:56 +0900 Subject: [PATCH] doc: merge repmgrd pause documentation into overview --- doc/filelist.sgml | 3 +- doc/repmgr.sgml | 3 +- doc/repmgrd-overview.sgml | 122 +++++++++++++++++++++++--- doc/repmgrd-pausing.sgml | 178 -------------------------------------- 4 files changed, 112 insertions(+), 194 deletions(-) delete mode 100644 doc/repmgrd-pausing.sgml diff --git a/doc/filelist.sgml b/doc/filelist.sgml index 2d08b9b5..1e240de6 100644 --- a/doc/filelist.sgml +++ b/doc/filelist.sgml @@ -53,12 +53,11 @@ - + - diff --git a/doc/repmgr.sgml b/doc/repmgr.sgml index beebefbd..0f2b4888 100644 --- a/doc/repmgr.sgml +++ b/doc/repmgr.sgml @@ -83,10 +83,9 @@ &repmgrd-overview; &repmgrd-automatic-failover; &repmgrd-configuration; - &repmgrd-demonstration; + &repmgrd-operation; &repmgrd-network-split; &repmgrd-witness-server; - &repmgrd-pausing; &repmgrd-degraded-monitoring; &repmgrd-monitoring; &repmgrd-notes; diff --git a/doc/repmgrd-overview.sgml b/doc/repmgrd-overview.sgml index 1d2d7fae..5ec26447 100644 --- a/doc/repmgrd-overview.sgml +++ b/doc/repmgrd-overview.sgml @@ -1,17 +1,115 @@ + + + repmgrd + overview + - - - repmgrd - overview - + repmgrd overview - repmgrd overview + + repmgrd ("replication manager daemon") + is a management and monitoring daemon which runs + on each node in a replication cluster. It can automate actions such as + failover and updating standbys to follow the new primary, as well as + providing monitoring information about the state of each standby. + - - repmgrd ("replication manager daemon") - is a management and monitoring daemon which runs - on each node in a replication cluster. It can automate actions such as - failover and updating standbys to follow the new primary, as well as - providing monitoring information about the state of each standby. + + + repmgrd demonstration + + To demonstrate automatic failover, set up a 3-node replication cluster (one primary + and two standbys streaming directly from the primary) so that the cluster looks + something like this: + + $ repmgr -f /etc/repmgr.conf cluster show + ID | Name | Role | Status | Upstream | Location | Connection string + ----+-------+---------+-----------+----------+----------+-------------------------------------- + 1 | node1 | primary | * running | | default | host=node1 dbname=repmgr user=repmgr + 2 | node2 | standby | running | node1 | default | host=node2 dbname=repmgr user=repmgr + 3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr + + Start repmgrd on each standby and verify that it's running by examining the + log output, which at log level INFO will look like this: + + [2017-08-24 17:31:00] [NOTICE] using configuration file "/etc/repmgr.conf" + [2017-08-24 17:31:00] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr" + [2017-08-24 17:31:00] [NOTICE] starting monitoring of node node2 (ID: 2) + [2017-08-24 17:31:00] [INFO] monitoring connection to upstream node "node1" (node ID: 1) + + + Each repmgrd should also have recorded its successful startup as an event: + + $ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start + Node ID | Name | Event | OK | Timestamp | Details + ---------+-------+---------------+----+---------------------+------------------------------------------------------------- + 3 | node3 | repmgrd_start | t | 2017-08-24 17:35:54 | monitoring connection to upstream node "node1" (node ID: 1) + 2 | node2 | repmgrd_start | t | 2017-08-24 17:35:50 | monitoring connection to upstream node "node1" (node ID: 1) + 1 | node1 | repmgrd_start | t | 2017-08-24 17:35:46 | monitoring cluster primary "node1" (node ID: 1) + + + Now stop the current primary server with e.g.: + + pg_ctl -D /var/lib/postgresql/data -m immediate stop + + + This will force the primary to shut down straight away, aborting all processes + and transactions. This will cause a flurry of activity in the repmgrd log + files as each repmgrd detects the failure of the primary and a failover + decision is made. This is an extract from the log of a standby server (node2) + which has promoted to new primary after failure of the original primary (node1). + + [2017-08-24 23:32:01] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state + [2017-08-24 23:32:08] [WARNING] unable to connect to upstream node "node1" (node ID: 1) + [2017-08-24 23:32:08] [INFO] checking state of node 1, 1 of 5 attempts + [2017-08-24 23:32:08] [INFO] sleeping 1 seconds until next reconnection attempt + [2017-08-24 23:32:09] [INFO] checking state of node 1, 2 of 5 attempts + [2017-08-24 23:32:09] [INFO] sleeping 1 seconds until next reconnection attempt + [2017-08-24 23:32:10] [INFO] checking state of node 1, 3 of 5 attempts + [2017-08-24 23:32:10] [INFO] sleeping 1 seconds until next reconnection attempt + [2017-08-24 23:32:11] [INFO] checking state of node 1, 4 of 5 attempts + [2017-08-24 23:32:11] [INFO] sleeping 1 seconds until next reconnection attempt + [2017-08-24 23:32:12] [INFO] checking state of node 1, 5 of 5 attempts + [2017-08-24 23:32:12] [WARNING] unable to reconnect to node 1 after 5 attempts + INFO: setting voting term to 1 + INFO: node 2 is candidate + INFO: node 3 has received request from node 2 for electoral term 1 (our term: 0) + [2017-08-24 23:32:12] [NOTICE] this node is the winner, will now promote self and inform other nodes + INFO: connecting to standby database + NOTICE: promoting standby + DETAIL: promoting server using 'pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' promote' + INFO: reconnecting to promoted server + NOTICE: STANDBY PROMOTE successful + DETAIL: node 2 was successfully promoted to primary + INFO: node 3 received notification to follow node 2 + [2017-08-24 23:32:13] [INFO] switching to primary monitoring mode + + + The cluster status will now look like this, with the original primary (node1) + marked as inactive, and standby node3 now following the new primary + (node2): + + $ repmgr -f /etc/repmgr.conf cluster show + ID | Name | Role | Status | Upstream | Location | Connection string + ----+-------+---------+-----------+----------+----------+---------------------------------------------------- + 1 | node1 | primary | - failed | | default | host=node1 dbname=repmgr user=repmgr + 2 | node2 | primary | * running | | default | host=node2 dbname=repmgr user=repmgr + 3 | node3 | standby | running | node2 | default | host=node3 dbname=repmgr user=repmgr + + + + repmgr cluster event will display a summary of what happened to each server + during the failover: + + $ repmgr -f /etc/repmgr.conf cluster event + Node ID | Name | Event | OK | Timestamp | Details + ---------+-------+--------------------------+----+---------------------+----------------------------------------------------------------------------------- + 3 | node3 | repmgrd_failover_follow | t | 2017-08-24 23:32:16 | node 3 now following new upstream node 2 + 3 | node3 | standby_follow | t | 2017-08-24 23:32:16 | node 3 is now attached to node 2 + 2 | node2 | repmgrd_failover_promote | t | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed + 2 | node2 | standby_promote | t | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary + + + diff --git a/doc/repmgrd-pausing.sgml b/doc/repmgrd-pausing.sgml deleted file mode 100644 index b24a3624..00000000 --- a/doc/repmgrd-pausing.sgml +++ /dev/null @@ -1,178 +0,0 @@ - - - - repmgrd - pausing - - - - pausing repmgrd - - - Pausing repmgrd - - - In normal operation, repmgrd monitors the state of the - PostgreSQL node it is running on, and will take appropriate action if problems - are detected, e.g. (if so configured) promote the node to primary, if the existing - primary has been determined as failed. - - - - However, repmgrd is unable to distinguish between - planned outages (such as performing a switchover - or installing PostgreSQL maintenance released), and an actual server outage. In versions prior to - &repmgr; 4.2 it was necessary to stop repmgrd on all nodes (or at least - on all nodes where repmgrd is - configured for automatic failover) - to prevent repmgrd from making unintentional changes to the - replication cluster. - - - - From &repmgr; 4.2, repmgrd - can now be "paused", i.e. instructed not to take any action such as performing a failover. - This can be done from any node in the cluster, removing the need to stop/restart - each repmgrd individually. - - - - - For major PostgreSQL upgrades, e.g. from PostgreSQL 10 to PostgreSQL 11, - repmgrd should be shut down completely and only started up - once the &repmgr; packages for the new PostgreSQL major version have been installed. - - - - - Prerequisites for pausing <application>repmgrd</application> - - In order to be able to pause/unpause repmgrd, following - prerequisites must be met: - - - - &repmgr; 4.2 or later must be installed on all nodes. - - - - The same major &repmgr; version (e.g. 4.2) must be installed on all nodes (and preferably the same minor version). - - - - - PostgreSQL on all nodes must be accessible from the node where the - pause/unpause operation is executed, using the - conninfo string shown by repmgr cluster show. - - - - - - - These conditions are required for normal &repmgr; operation in any case. - - - - - - - Pausing/unpausing <application>repmgrd</application> - - To pause repmgrd, execute repmgr daemon pause, e.g.: - -$ repmgr -f /etc/repmgr.conf daemon pause -NOTICE: node 1 (node1) paused -NOTICE: node 2 (node2) paused -NOTICE: node 3 (node3) paused - - - The state of repmgrd on each node can be checked with - repmgr daemon status, e.g.: - $ repmgr -f /etc/repmgr.conf daemon status - ID | Name | Role | Status | repmgrd | PID | Paused? -----+-------+---------+---------+---------+------+--------- - 1 | node1 | primary | running | running | 7851 | yes - 2 | node2 | standby | running | running | 7889 | yes - 3 | node3 | standby | running | running | 7918 | yes - - - - - If executing a switchover with repmgr standby switchover, - &repmgr; will automatically pause/unpause repmgrd as part of the switchover process. - - - - - If the primary (in this example, node1) is stopped, repmgrd - running on one of the standbys (here: node2) will react like this: - -[2018-09-20 12:22:21] [WARNING] unable to connect to upstream node "node1" (node ID: 1) -[2018-09-20 12:22:21] [INFO] checking state of node 1, 1 of 5 attempts -[2018-09-20 12:22:21] [INFO] sleeping 1 seconds until next reconnection attempt -... -[2018-09-20 12:22:24] [INFO] sleeping 1 seconds until next reconnection attempt -[2018-09-20 12:22:25] [INFO] checking state of node 1, 5 of 5 attempts -[2018-09-20 12:22:25] [WARNING] unable to reconnect to node 1 after 5 attempts -[2018-09-20 12:22:25] [NOTICE] node is paused -[2018-09-20 12:22:33] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state -[2018-09-20 12:22:33] [DETAIL] repmgrd paused by administrator -[2018-09-20 12:22:33] [HINT] execute "repmgr daemon unpause" to resume normal failover mode - - - If the primary becomes available again (e.g. following a software upgrade), repmgrd - will automatically reconnect, e.g.: - -[2018-09-20 13:12:41] [NOTICE] reconnected to upstream node 1 after 8 seconds, resuming monitoring - - - - To unpause repmgrd, execute repmgr daemon unpause, e.g.: - -$ repmgr -f /etc/repmgr.conf daemon unpause -NOTICE: node 1 (node1) unpaused -NOTICE: node 2 (node2) unpaused -NOTICE: node 3 (node3) unpaused - - - - - If the previous primary is no longer accessible when repmgrd - is unpaused, no failover action will be taken. Instead, a new primary must be manually promoted using - repmgr standby promote, - and any standbys attached to the new primary with - repmgr standby follow. - - - This is to prevent repmgr daemon unpause - resulting in the automatic promotion of a new primary, which may be a problem particularly - in larger clusters, where repmgrd could select a different promotion - candidate to the one intended by the administrator. - - - - - Details on the <application>repmgrd</application> pausing mechanism - - - The pause state of each node will be stored over a PostgreSQL restart. - - - - repmgr daemon pause and - repmgr daemon unpause can be - executed even if repmgrd is not running; in this case, - repmgrd will start up in whichever pause state has been set. - - - - repmgr daemon pause and - repmgr daemon unpause - do not stop/start repmgrd. - - - - - -