Further repmgrd documentation

2026-07-16 14:29:05 +00:00 · 2017-10-05 14:32:57 +09:00
parent a4e79d33af
commit fee4569887
4 changed files with 121 additions and 0 deletions
@@ -46,6 +46,7 @@

 <!ENTITY repmgrd-automatic-failover SYSTEM "repmgrd-automatic-failover.sgml">
 <!ENTITY repmgrd-configuration SYSTEM "repmgrd-configuration.sgml">
+<!ENTITY repmgrd-demonstration SYSTEM "repmgrd-demonstration.sgml">

 <!ENTITY repmgr-primary-register SYSTEM "repmgr-primary-register.sgml">
 <!ENTITY repmgr-primary-unregister SYSTEM "repmgr-primary-unregister.sgml">
@@ -79,6 +79,7 @@
  <title>Using repmgrd</title>
  &repmgrd-automatic-failover;
  &repmgrd-configuration;
+  &repmgrd-demonstration;
 </part>

 <part id="repmgr-command-reference">
@@ -47,5 +47,28 @@
  <command>repmgr standby follow</command> will result in the node continuing to follow
  the original primary.
 </para>
+ <sect1 id="repmgrd-connection-settings">
+ <title>repmgrd connection settings</title>
+ <para>
+  In addition to the &repmgr; configuration settings, parameters in the
+  <varname>conninfo</varname> string influence how &repmgr; makes a network connection to
+  PostgreSQL. In particular, if another server in the replication cluster
+  is unreachable at network level, system network settings will influence
+  the length of time it takes to determine that the connection is not possible.
+ </para>
+ <para>
+  In particular explicitly setting a parameter for <literal>connect_timeout</literal>
+  should be considered; the effective minimum value of <literal>2</literal>
+  (seconds) will ensure that a connection failure at network level is reported
+  as soon as possible, otherwise depending on the system settings (e.g.
+  <varname>tcp_syn_retries</varname> in Linux) a delay of a minute or more
+  is possible.
+ </para>
+ <para>
+  For further details on <varname>conninfo</varname> network connection
+  parameters, see the
+  <ulink url="https://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS">PostgreSQL documentation</ulink>.
+ </para>
+ </sect1>

 </chapter>
@@ -0,0 +1,96 @@
+<chapter id="repmgrd-demonstration">
+ <title>repmgrd demonstration</title>
+ <para>
+  To demonstrate automatic failover, set up a 3-node replication cluster (one primary
+  and two standbys streaming directly from the primary) so that the cluster looks
+  something like this:
+  <programlisting>
+    $ repmgr -f /etc/repmgr.conf cluster show
+     ID | Name  | Role    | Status    | Upstream | Location | Connection string
+    ----+-------+---------+-----------+----------+----------+--------------------------------------
+     1  | node1 | primary | * running |          | default  | host=node1 dbname=repmgr user=repmgr
+     2  | node2 | standby |   running | node1    | default  | host=node2 dbname=repmgr user=repmgr
+     3  | node3 | standby |   running | node1    | default  | host=node3 dbname=repmgr user=repmgr  </programlisting>
+ </para>
+ <para>
+  Start <command>repmgrd</command> on each standby and verify that it's running by examining the
+  log output, which at log level <literal>INFO</literal> will look like this:
+  <programlisting>
+    [2017-08-24 17:31:00] [NOTICE] using configuration file "/etc/repmgr.conf"
+    [2017-08-24 17:31:00] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr"
+    [2017-08-24 17:31:00] [NOTICE] starting monitoring of node <literal>node2</literal> (ID: 2)
+    [2017-08-24 17:31:00] [INFO] monitoring connection to upstream node "node1" (node ID: 1)  </programlisting>
+ </para>
+ <para>
+  Each <command>repmgrd</command> should also have recorded its successful startup as an event:
+  <programlisting>
+    $ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start
+     Node ID | Name  | Event         | OK | Timestamp           | Details
+    ---------+-------+---------------+----+---------------------+-------------------------------------------------------------
+     3       | node3 | repmgrd_start | t  | 2017-08-24 17:35:54 | monitoring connection to upstream node "node1" (node ID: 1)
+     2       | node2 | repmgrd_start | t  | 2017-08-24 17:35:50 | monitoring connection to upstream node "node1" (node ID: 1)
+     1       | node1 | repmgrd_start | t  | 2017-08-24 17:35:46 | monitoring cluster primary "node1" (node ID: 1)  </programlisting>
+ </para>
+ <para>
+  Now stop the current primary server with e.g.:
+  <programlisting>
+    pg_ctl -D /var/lib/postgresql/data -m immediate stop</programlisting>
+ </para>
+ <para>
+  This will force the primary to shut down straight away, aborting all processes
+  and transactions.  This will cause a flurry of activity in the <command>repmgrd</command> log
+  files as each <command>repmgrd</command> detects the failure of the primary and a failover
+  decision is made. This is an extract from the log of a standby server (<literal>node2</literal>)
+  which has promoted to new primary after failure of the original primary (<literal>node1</literal>).
+  <programlisting>
+    [2017-08-24 23:32:01] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
+    [2017-08-24 23:32:08] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
+    [2017-08-24 23:32:08] [INFO] checking state of node 1, 1 of 5 attempts
+    [2017-08-24 23:32:08] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-08-24 23:32:09] [INFO] checking state of node 1, 2 of 5 attempts
+    [2017-08-24 23:32:09] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-08-24 23:32:10] [INFO] checking state of node 1, 3 of 5 attempts
+    [2017-08-24 23:32:10] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-08-24 23:32:11] [INFO] checking state of node 1, 4 of 5 attempts
+    [2017-08-24 23:32:11] [INFO] sleeping 1 seconds until next reconnection attempt
+    [2017-08-24 23:32:12] [INFO] checking state of node 1, 5 of 5 attempts
+    [2017-08-24 23:32:12] [WARNING] unable to reconnect to node 1 after 5 attempts
+    INFO:  setting voting term to 1
+    INFO:  node 2 is candidate
+    INFO:  node 3 has received request from node 2 for electoral term 1 (our term: 0)
+    [2017-08-24 23:32:12] [NOTICE] this node is the winner, will now promote self and inform other nodes
+    INFO: connecting to standby database
+    NOTICE: promoting standby
+    DETAIL: promoting server using '/home/barwick/devel/builds/HEAD/bin/pg_ctl -l /tmp/postgres.5602.log -w -D '/tmp/repmgr-test/node_2/data' promote'
+    INFO: reconnecting to promoted server
+    NOTICE: STANDBY PROMOTE successful
+    DETAIL: node 2 was successfully promoted to primary
+    INFO:  node 3 received notification to follow node 2
+    [2017-08-24 23:32:13] [INFO] switching to primary monitoring mode</programlisting>
+ </para>
+ <para>
+  The cluster status will now look like this, with the original primary (<literal>node1</literal>)
+  marked as inactive, and standby <literal>node3</literal> now following the new primary
+  (<literal>node2</literal>):
+  <programlisting>
+    $ repmgr -f /etc/repmgr.conf cluster show
+     ID | Name  | Role    | Status    | Upstream | Location | Connection string
+    ----+-------+---------+-----------+----------+----------+----------------------------------------------------
+     1  | node1 | primary | - failed  |          | default  | host=node1 dbname=repmgr user=repmgr
+     2  | node2 | primary | * running |          | default  | host=node2 dbname=repmgr user=repmgr
+     3  | node3 | standby |   running | node2    | default  | host=node3 dbname=repmgr user=repmgr</programlisting>
+
+ </para>
+ <para>
+  <command>repmgr cluster event</command> will display a summary of what happened to each server
+  during the failover:
+  <programlisting>
+    $ repmgr -f /etc/repmgr.conf cluster event
+     Node ID | Name  | Event                    | OK | Timestamp           | Details
+    ---------+-------+--------------------------+----+---------------------+-----------------------------------------------------------------------------------
+     3       | node3 | repmgrd_failover_follow  | t  | 2017-08-24 23:32:16 | node 3 now following new upstream node 2
+     3       | node3 | standby_follow           | t  | 2017-08-24 23:32:16 | node 3 is now attached to node 2
+     2       | node2 | repmgrd_failover_promote | t  | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed
+     2       | node2 | standby_promote          | t  | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary</programlisting>
+ </para>
+</chapter>