diff --git a/doc/filelist.sgml b/doc/filelist.sgml index 6587bc05..940cf68b 100644 --- a/doc/filelist.sgml +++ b/doc/filelist.sgml @@ -46,6 +46,7 @@ + diff --git a/doc/repmgr.sgml b/doc/repmgr.sgml index 5a79167a..11bb2b5d 100644 --- a/doc/repmgr.sgml +++ b/doc/repmgr.sgml @@ -79,6 +79,7 @@ Using repmgrd &repmgrd-automatic-failover; &repmgrd-configuration; + &repmgrd-demonstration; diff --git a/doc/repmgrd-configuration.sgml b/doc/repmgrd-configuration.sgml index 621161d4..07f4c82b 100644 --- a/doc/repmgrd-configuration.sgml +++ b/doc/repmgrd-configuration.sgml @@ -47,5 +47,28 @@ repmgr standby follow will result in the node continuing to follow the original primary. + + repmgrd connection settings + + In addition to the &repmgr; configuration settings, parameters in the + conninfo string influence how &repmgr; makes a network connection to + PostgreSQL. In particular, if another server in the replication cluster + is unreachable at network level, system network settings will influence + the length of time it takes to determine that the connection is not possible. + + + In particular explicitly setting a parameter for connect_timeout + should be considered; the effective minimum value of 2 + (seconds) will ensure that a connection failure at network level is reported + as soon as possible, otherwise depending on the system settings (e.g. + tcp_syn_retries in Linux) a delay of a minute or more + is possible. + + + For further details on conninfo network connection + parameters, see the + PostgreSQL documentation. + + diff --git a/doc/repmgrd-demonstration.sgml b/doc/repmgrd-demonstration.sgml new file mode 100644 index 00000000..401c2db5 --- /dev/null +++ b/doc/repmgrd-demonstration.sgml @@ -0,0 +1,96 @@ + + repmgrd demonstration + + To demonstrate automatic failover, set up a 3-node replication cluster (one primary + and two standbys streaming directly from the primary) so that the cluster looks + something like this: + + $ repmgr -f /etc/repmgr.conf cluster show + ID | Name | Role | Status | Upstream | Location | Connection string + ----+-------+---------+-----------+----------+----------+-------------------------------------- + 1 | node1 | primary | * running | | default | host=node1 dbname=repmgr user=repmgr + 2 | node2 | standby | running | node1 | default | host=node2 dbname=repmgr user=repmgr + 3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr + + + Start repmgrd on each standby and verify that it's running by examining the + log output, which at log level INFO will look like this: + + [2017-08-24 17:31:00] [NOTICE] using configuration file "/etc/repmgr.conf" + [2017-08-24 17:31:00] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr" + [2017-08-24 17:31:00] [NOTICE] starting monitoring of node node2 (ID: 2) + [2017-08-24 17:31:00] [INFO] monitoring connection to upstream node "node1" (node ID: 1) + + + Each repmgrd should also have recorded its successful startup as an event: + + $ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start + Node ID | Name | Event | OK | Timestamp | Details + ---------+-------+---------------+----+---------------------+------------------------------------------------------------- + 3 | node3 | repmgrd_start | t | 2017-08-24 17:35:54 | monitoring connection to upstream node "node1" (node ID: 1) + 2 | node2 | repmgrd_start | t | 2017-08-24 17:35:50 | monitoring connection to upstream node "node1" (node ID: 1) + 1 | node1 | repmgrd_start | t | 2017-08-24 17:35:46 | monitoring cluster primary "node1" (node ID: 1) + + + Now stop the current primary server with e.g.: + + pg_ctl -D /var/lib/postgresql/data -m immediate stop + + + This will force the primary to shut down straight away, aborting all processes + and transactions. This will cause a flurry of activity in the repmgrd log + files as each repmgrd detects the failure of the primary and a failover + decision is made. This is an extract from the log of a standby server (node2) + which has promoted to new primary after failure of the original primary (node1). + + [2017-08-24 23:32:01] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state + [2017-08-24 23:32:08] [WARNING] unable to connect to upstream node "node1" (node ID: 1) + [2017-08-24 23:32:08] [INFO] checking state of node 1, 1 of 5 attempts + [2017-08-24 23:32:08] [INFO] sleeping 1 seconds until next reconnection attempt + [2017-08-24 23:32:09] [INFO] checking state of node 1, 2 of 5 attempts + [2017-08-24 23:32:09] [INFO] sleeping 1 seconds until next reconnection attempt + [2017-08-24 23:32:10] [INFO] checking state of node 1, 3 of 5 attempts + [2017-08-24 23:32:10] [INFO] sleeping 1 seconds until next reconnection attempt + [2017-08-24 23:32:11] [INFO] checking state of node 1, 4 of 5 attempts + [2017-08-24 23:32:11] [INFO] sleeping 1 seconds until next reconnection attempt + [2017-08-24 23:32:12] [INFO] checking state of node 1, 5 of 5 attempts + [2017-08-24 23:32:12] [WARNING] unable to reconnect to node 1 after 5 attempts + INFO: setting voting term to 1 + INFO: node 2 is candidate + INFO: node 3 has received request from node 2 for electoral term 1 (our term: 0) + [2017-08-24 23:32:12] [NOTICE] this node is the winner, will now promote self and inform other nodes + INFO: connecting to standby database + NOTICE: promoting standby + DETAIL: promoting server using '/home/barwick/devel/builds/HEAD/bin/pg_ctl -l /tmp/postgres.5602.log -w -D '/tmp/repmgr-test/node_2/data' promote' + INFO: reconnecting to promoted server + NOTICE: STANDBY PROMOTE successful + DETAIL: node 2 was successfully promoted to primary + INFO: node 3 received notification to follow node 2 + [2017-08-24 23:32:13] [INFO] switching to primary monitoring mode + + + The cluster status will now look like this, with the original primary (node1) + marked as inactive, and standby node3 now following the new primary + (node2): + + $ repmgr -f /etc/repmgr.conf cluster show + ID | Name | Role | Status | Upstream | Location | Connection string + ----+-------+---------+-----------+----------+----------+---------------------------------------------------- + 1 | node1 | primary | - failed | | default | host=node1 dbname=repmgr user=repmgr + 2 | node2 | primary | * running | | default | host=node2 dbname=repmgr user=repmgr + 3 | node3 | standby | running | node2 | default | host=node3 dbname=repmgr user=repmgr + + + + repmgr cluster event will display a summary of what happened to each server + during the failover: + + $ repmgr -f /etc/repmgr.conf cluster event + Node ID | Name | Event | OK | Timestamp | Details + ---------+-------+--------------------------+----+---------------------+----------------------------------------------------------------------------------- + 3 | node3 | repmgrd_failover_follow | t | 2017-08-24 23:32:16 | node 3 now following new upstream node 2 + 3 | node3 | standby_follow | t | 2017-08-24 23:32:16 | node 3 is now attached to node 2 + 2 | node2 | repmgrd_failover_promote | t | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed + 2 | node2 | standby_promote | t | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary + +