Further repmgrd documentation

2026-06-01 03:39:05 +00:00 · 2017-10-05 14:32:57 +09:00
parent a4e79d33af
commit fee4569887
4 changed files with 121 additions and 0 deletions
@@ -46,6 +46,7 @@
 <!ENTITY repmgrd-automatic-failover SYSTEM "repmgrd-automatic-failover.sgml">
 <!ENTITY repmgrd-configuration SYSTEM "repmgrd-configuration.sgml">
 <!ENTITY repmgrd-demonstration SYSTEM "repmgrd-demonstration.sgml">
 <!ENTITY repmgr-primary-register SYSTEM "repmgr-primary-register.sgml">
 <!ENTITY repmgr-primary-unregister SYSTEM "repmgr-primary-unregister.sgml">
@@ -79,6 +79,7 @@
  <title>Using repmgrd</title>
  &repmgrd-automatic-failover;
  &repmgrd-configuration;
  &repmgrd-demonstration;
 </part>
 <part id="repmgr-command-reference">
@@ -47,5 +47,28 @@
  <command>repmgr standby follow</command> will result in the node continuing to follow
  the original primary.
 </para>
 <sect1 id="repmgrd-connection-settings">
 <title>repmgrd connection settings</title>
 <para>
  In addition to the &repmgr; configuration settings, parameters in the
  <varname>conninfo</varname> string influence how &repmgr; makes a network connection to
  PostgreSQL. In particular, if another server in the replication cluster
  is unreachable at network level, system network settings will influence
  the length of time it takes to determine that the connection is not possible.
 </para>
 <para>
  In particular explicitly setting a parameter for <literal>connect_timeout</literal>
  should be considered; the effective minimum value of <literal>2</literal>
  (seconds) will ensure that a connection failure at network level is reported
  as soon as possible, otherwise depending on the system settings (e.g.
  <varname>tcp_syn_retries</varname> in Linux) a delay of a minute or more
  is possible.
 </para>
 <para>
  For further details on <varname>conninfo</varname> network connection
  parameters, see the
  <ulink url="https://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS">PostgreSQL documentation</ulink>.
 </para>
 </sect1>
 </chapter>
@@ -0,0 +1,96 @@
 <chapter id="repmgrd-demonstration">
 <title>repmgrd demonstration</title>
 <para>
  To demonstrate automatic failover, set up a 3-node replication cluster (one primary
  and two standbys streaming directly from the primary) so that the cluster looks
  something like this:
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string
    ----+-------+---------+-----------+----------+----------+--------------------------------------
     1  | node1 | primary | * running |          | default  | host=node1 dbname=repmgr user=repmgr
     2  | node2 | standby |   running | node1    | default  | host=node2 dbname=repmgr user=repmgr
     3  | node3 | standby |   running | node1    | default  | host=node3 dbname=repmgr user=repmgr  </programlisting>
 </para>
 <para>
  Start <command>repmgrd</command> on each standby and verify that it's running by examining the
  log output, which at log level <literal>INFO</literal> will look like this:
  <programlisting>
    [2017-08-24 17:31:00] [NOTICE] using configuration file "/etc/repmgr.conf"
    [2017-08-24 17:31:00] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr"
    [2017-08-24 17:31:00] [NOTICE] starting monitoring of node <literal>node2</literal> (ID: 2)
    [2017-08-24 17:31:00] [INFO] monitoring connection to upstream node "node1" (node ID: 1)  </programlisting>
 </para>
 <para>
  Each <command>repmgrd</command> should also have recorded its successful startup as an event:
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start
     Node ID | Name  | Event         | OK | Timestamp           | Details
    ---------+-------+---------------+----+---------------------+-------------------------------------------------------------
     3       | node3 | repmgrd_start | t  | 2017-08-24 17:35:54 | monitoring connection to upstream node "node1" (node ID: 1)
     2       | node2 | repmgrd_start | t  | 2017-08-24 17:35:50 | monitoring connection to upstream node "node1" (node ID: 1)
     1       | node1 | repmgrd_start | t  | 2017-08-24 17:35:46 | monitoring cluster primary "node1" (node ID: 1)  </programlisting>
 </para>
 <para>
  Now stop the current primary server with e.g.:
  <programlisting>
    pg_ctl -D /var/lib/postgresql/data -m immediate stop</programlisting>
 </para>
 <para>
  This will force the primary to shut down straight away, aborting all processes
  and transactions.  This will cause a flurry of activity in the <command>repmgrd</command> log
  files as each <command>repmgrd</command> detects the failure of the primary and a failover
  decision is made. This is an extract from the log of a standby server (<literal>node2</literal>)
  which has promoted to new primary after failure of the original primary (<literal>node1</literal>).
  <programlisting>
    [2017-08-24 23:32:01] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2017-08-24 23:32:08] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
    [2017-08-24 23:32:08] [INFO] checking state of node 1, 1 of 5 attempts
    [2017-08-24 23:32:08] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:09] [INFO] checking state of node 1, 2 of 5 attempts
    [2017-08-24 23:32:09] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:10] [INFO] checking state of node 1, 3 of 5 attempts
    [2017-08-24 23:32:10] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:11] [INFO] checking state of node 1, 4 of 5 attempts
    [2017-08-24 23:32:11] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:12] [INFO] checking state of node 1, 5 of 5 attempts
    [2017-08-24 23:32:12] [WARNING] unable to reconnect to node 1 after 5 attempts
    INFO:  setting voting term to 1
    INFO:  node 2 is candidate
    INFO:  node 3 has received request from node 2 for electoral term 1 (our term: 0)
    [2017-08-24 23:32:12] [NOTICE] this node is the winner, will now promote self and inform other nodes
    INFO: connecting to standby database
    NOTICE: promoting standby
    DETAIL: promoting server using '/home/barwick/devel/builds/HEAD/bin/pg_ctl -l /tmp/postgres.5602.log -w -D '/tmp/repmgr-test/node_2/data' promote'
    INFO: reconnecting to promoted server
    NOTICE: STANDBY PROMOTE successful
    DETAIL: node 2 was successfully promoted to primary
    INFO:  node 3 received notification to follow node 2
    [2017-08-24 23:32:13] [INFO] switching to primary monitoring mode</programlisting>
 </para>
 <para>
  The cluster status will now look like this, with the original primary (<literal>node1</literal>)
  marked as inactive, and standby <literal>node3</literal> now following the new primary
  (<literal>node2</literal>):
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string
    ----+-------+---------+-----------+----------+----------+----------------------------------------------------
     1  | node1 | primary | - failed  |          | default  | host=node1 dbname=repmgr user=repmgr
     2  | node2 | primary | * running |          | default  | host=node2 dbname=repmgr user=repmgr
     3  | node3 | standby |   running | node2    | default  | host=node3 dbname=repmgr user=repmgr</programlisting>
 </para>
 <para>
  <command>repmgr cluster event</command> will display a summary of what happened to each server
  during the failover:
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster event
     Node ID | Name  | Event                    | OK | Timestamp           | Details
    ---------+-------+--------------------------+----+---------------------+-----------------------------------------------------------------------------------
     3       | node3 | repmgrd_failover_follow  | t  | 2017-08-24 23:32:16 | node 3 now following new upstream node 2
     3       | node3 | standby_follow           | t  | 2017-08-24 23:32:16 | node 3 is now attached to node 2
     2       | node2 | repmgrd_failover_promote | t  | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed
     2       | node2 | standby_promote          | t  | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary</programlisting>
 </para>
 </chapter>