From 5f92fbddf2a76cfd2f7ce63b2716a837f21d3dc5 Mon Sep 17 00:00:00 2001 From: Ian Barwick Date: Wed, 13 Mar 2019 16:55:32 +0900 Subject: [PATCH] doc: various updates --- doc/configuring-witness-server.sgml | 8 +-- doc/repmgrd-automatic-failover.sgml | 20 +++--- doc/repmgrd-demonstration.sgml | 96 ----------------------------- doc/repmgrd-overview.sgml | 7 +++ 4 files changed, 24 insertions(+), 107 deletions(-) delete mode 100644 doc/repmgrd-demonstration.sgml diff --git a/doc/configuring-witness-server.sgml b/doc/configuring-witness-server.sgml index 6f798acf..54b0aee9 100644 --- a/doc/configuring-witness-server.sgml +++ b/doc/configuring-witness-server.sgml @@ -1,7 +1,6 @@ witness server - Using a witness server with repmgrd @@ -9,8 +8,9 @@ A is a normal PostgreSQL instance which is not part of the streaming replication cluster; its purpose is, if a - failover situation occurs, to provide proof that the primary server - itself is unavailable. + failover situation occurs, to provide proof that it is the primary server + itself which is unavailable, rather than e.g. a network split between + different physical locations. @@ -53,7 +53,7 @@ in the same physical location as the cluster's primary server. - This instance should *not* be on the same physical host as the primary server, + This instance should not be on the same physical host as the primary server, as otherwise if the primary server fails due to hardware issues, the witness server will be lost too. diff --git a/doc/repmgrd-automatic-failover.sgml b/doc/repmgrd-automatic-failover.sgml index d89b6de5..8d893b06 100644 --- a/doc/repmgrd-automatic-failover.sgml +++ b/doc/repmgrd-automatic-failover.sgml @@ -27,13 +27,13 @@ Using a witness server with repmgrd In a situation caused e.g. by a network interruption between two - data centres, it's important to avoid a "split-brain" situation where + data centres, it's important to avoid a "split-brain" situation where both sides of the network assume they are the active segment and the side without an active primary unilaterally promotes one of its standbys. To prevent this situation happening, it's essential to ensure that one - network segment has a "voting majority", so other segments will know + network segment has a "voting majority", so other segments will know they're in the minority and not attempt to promote a new primary. Where an odd number of servers exists, this is not an issue. However, if each network has an even number of nodes, it's necessary to provide some way @@ -41,13 +41,19 @@ This is not a fully-fledged standby node and is not integrated into - replication, but it effectively represents the "casting vote" when + replication, but it effectively represents the "casting vote" when deciding which network segment has a majority. A witness server can - be set up using . Note that it only - makes sense to create a witness server in conjunction with running - repmgrd; the witness server will require its own - repmgrd instance. + be set up using repmgr witness register; + see also section Using a witness server. + + + It only + makes sense to create a witness server in conjunction with running + repmgrd; the witness server will require its own + repmgrd instance. + + diff --git a/doc/repmgrd-demonstration.sgml b/doc/repmgrd-demonstration.sgml deleted file mode 100644 index 2a0530a9..00000000 --- a/doc/repmgrd-demonstration.sgml +++ /dev/null @@ -1,96 +0,0 @@ - - repmgrd demonstration - - To demonstrate automatic failover, set up a 3-node replication cluster (one primary - and two standbys streaming directly from the primary) so that the cluster looks - something like this: - - $ repmgr -f /etc/repmgr.conf cluster show - ID | Name | Role | Status | Upstream | Location | Connection string - ----+-------+---------+-----------+----------+----------+-------------------------------------- - 1 | node1 | primary | * running | | default | host=node1 dbname=repmgr user=repmgr - 2 | node2 | standby | running | node1 | default | host=node2 dbname=repmgr user=repmgr - 3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr - - - Start repmgrd on each standby and verify that it's running by examining the - log output, which at log level INFO will look like this: - - [2017-08-24 17:31:00] [NOTICE] using configuration file "/etc/repmgr.conf" - [2017-08-24 17:31:00] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr" - [2017-08-24 17:31:00] [NOTICE] starting monitoring of node node2 (ID: 2) - [2017-08-24 17:31:00] [INFO] monitoring connection to upstream node "node1" (node ID: 1) - - - Each repmgrd should also have recorded its successful startup as an event: - - $ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start - Node ID | Name | Event | OK | Timestamp | Details - ---------+-------+---------------+----+---------------------+------------------------------------------------------------- - 3 | node3 | repmgrd_start | t | 2017-08-24 17:35:54 | monitoring connection to upstream node "node1" (node ID: 1) - 2 | node2 | repmgrd_start | t | 2017-08-24 17:35:50 | monitoring connection to upstream node "node1" (node ID: 1) - 1 | node1 | repmgrd_start | t | 2017-08-24 17:35:46 | monitoring cluster primary "node1" (node ID: 1) - - - Now stop the current primary server with e.g.: - - pg_ctl -D /var/lib/postgresql/data -m immediate stop - - - This will force the primary to shut down straight away, aborting all processes - and transactions. This will cause a flurry of activity in the repmgrd log - files as each repmgrd detects the failure of the primary and a failover - decision is made. This is an extract from the log of a standby server (node2) - which has promoted to new primary after failure of the original primary (node1). - - [2017-08-24 23:32:01] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state - [2017-08-24 23:32:08] [WARNING] unable to connect to upstream node "node1" (node ID: 1) - [2017-08-24 23:32:08] [INFO] checking state of node 1, 1 of 5 attempts - [2017-08-24 23:32:08] [INFO] sleeping 1 seconds until next reconnection attempt - [2017-08-24 23:32:09] [INFO] checking state of node 1, 2 of 5 attempts - [2017-08-24 23:32:09] [INFO] sleeping 1 seconds until next reconnection attempt - [2017-08-24 23:32:10] [INFO] checking state of node 1, 3 of 5 attempts - [2017-08-24 23:32:10] [INFO] sleeping 1 seconds until next reconnection attempt - [2017-08-24 23:32:11] [INFO] checking state of node 1, 4 of 5 attempts - [2017-08-24 23:32:11] [INFO] sleeping 1 seconds until next reconnection attempt - [2017-08-24 23:32:12] [INFO] checking state of node 1, 5 of 5 attempts - [2017-08-24 23:32:12] [WARNING] unable to reconnect to node 1 after 5 attempts - INFO: setting voting term to 1 - INFO: node 2 is candidate - INFO: node 3 has received request from node 2 for electoral term 1 (our term: 0) - [2017-08-24 23:32:12] [NOTICE] this node is the winner, will now promote self and inform other nodes - INFO: connecting to standby database - NOTICE: promoting standby - DETAIL: promoting server using 'pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' promote' - INFO: reconnecting to promoted server - NOTICE: STANDBY PROMOTE successful - DETAIL: node 2 was successfully promoted to primary - INFO: node 3 received notification to follow node 2 - [2017-08-24 23:32:13] [INFO] switching to primary monitoring mode - - - The cluster status will now look like this, with the original primary (node1) - marked as inactive, and standby node3 now following the new primary - (node2): - - $ repmgr -f /etc/repmgr.conf cluster show - ID | Name | Role | Status | Upstream | Location | Connection string - ----+-------+---------+-----------+----------+----------+---------------------------------------------------- - 1 | node1 | primary | - failed | | default | host=node1 dbname=repmgr user=repmgr - 2 | node2 | primary | * running | | default | host=node2 dbname=repmgr user=repmgr - 3 | node3 | standby | running | node2 | default | host=node3 dbname=repmgr user=repmgr - - - - repmgr cluster event will display a summary of what happened to each server - during the failover: - - $ repmgr -f /etc/repmgr.conf cluster event - Node ID | Name | Event | OK | Timestamp | Details - ---------+-------+--------------------------+----+---------------------+----------------------------------------------------------------------------------- - 3 | node3 | repmgrd_failover_follow | t | 2017-08-24 23:32:16 | node 3 now following new upstream node 2 - 3 | node3 | standby_follow | t | 2017-08-24 23:32:16 | node 3 is now attached to node 2 - 2 | node2 | repmgrd_failover_promote | t | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed - 2 | node2 | standby_promote | t | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary - - diff --git a/doc/repmgrd-overview.sgml b/doc/repmgrd-overview.sgml index 5ec26447..5be2805a 100644 --- a/doc/repmgrd-overview.sgml +++ b/doc/repmgrd-overview.sgml @@ -29,6 +29,13 @@ 2 | node2 | standby | running | node1 | default | host=node2 dbname=repmgr user=repmgr 3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr + + + + See section Required configuration for automatic failover + for an example of minimal repmgr.conf file settings suitable for use with repmgrd. + + Start repmgrd on each standby and verify that it's running by examining the log output, which at log level INFO will look like this: