doc: merge repmgrd pause documentation into overview

2026-07-16 14:29:05 +00:00 · 2019-03-13 14:53:56 +09:00
parent 18ab5cab4e
commit 2e67bc1341
4 changed files with 112 additions and 194 deletions
@@ -53,12 +53,11 @@
 <!ENTITY repmgrd-overview SYSTEM "repmgrd-overview.sgml">
 <!ENTITY repmgrd-automatic-failover SYSTEM "repmgrd-automatic-failover.sgml">
 <!ENTITY repmgrd-configuration SYSTEM "repmgrd-configuration.sgml">
-<!ENTITY repmgrd-demonstration SYSTEM "repmgrd-demonstration.sgml">
+<!ENTITY repmgrd-operation SYSTEM "repmgrd-operation.sgml">
 <!ENTITY repmgrd-monitoring SYSTEM "repmgrd-monitoring.sgml">
 <!ENTITY repmgrd-degraded-monitoring SYSTEM "repmgrd-degraded-monitoring.sgml">
 <!ENTITY repmgrd-network-split SYSTEM "repmgrd-network-split.sgml">
 <!ENTITY repmgrd-witness-server SYSTEM "repmgrd-witness-server.sgml">
 <!ENTITY repmgrd-pausing SYSTEM "repmgrd-pausing.sgml">
 <!ENTITY repmgrd-notes SYSTEM "repmgrd-notes.sgml">
 <!ENTITY repmgrd-bdr SYSTEM "repmgrd-bdr.sgml">
@@ -83,10 +83,9 @@
  &repmgrd-overview;
  &repmgrd-automatic-failover;
  &repmgrd-configuration;
-  &repmgrd-demonstration;
+  &repmgrd-operation;
  &repmgrd-network-split;
  &repmgrd-witness-server;
  &repmgrd-pausing;
  &repmgrd-degraded-monitoring;
  &repmgrd-monitoring;
  &repmgrd-notes;
@@ -1,17 +1,115 @@
 <chapter id="repmgrd-overview" xreflabel="repmgrd overview">
  <indexterm>
    <primary>repmgrd</primary>
    <secondary>overview</secondary>
  </indexterm>
-<chapter id="repmgrd-overview" xreflabel="Overview of repmgrd">
+  <title>repmgrd overview</title>
 <indexterm>
   <primary>repmgrd</primary>
   <secondary>overview</secondary>
 </indexterm>
- <title>repmgrd overview</title>
+  <para>
    <application>repmgrd</application> (&quot;<literal>replication manager daemon</literal>&quot;)
    is a management and monitoring daemon which runs
    on each node in a replication cluster. It can automate actions such as
    failover and updating standbys to follow the new primary, as well as
    providing monitoring information about the state of each standby.
  </para>
- <para>
+  <sect1 id="repmgrd-demonstration">
-   <application>repmgrd</application> (&quot;<literal>replication manager daemon</literal>&quot;)
+
-   is a management and monitoring daemon which runs
+    <title>repmgrd demonstration</title>
-   on each node in a replication cluster. It can automate actions such as
+    <para>
-   failover and updating standbys to follow the new primary, as well as
+  To demonstrate automatic failover, set up a 3-node replication cluster (one primary
-   providing monitoring information about the state of each standby.
+  and two standbys streaming directly from the primary) so that the cluster looks
  something like this:
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string
    ----+-------+---------+-----------+----------+----------+--------------------------------------
     1  | node1 | primary | * running |          | default  | host=node1 dbname=repmgr user=repmgr
     2  | node2 | standby |   running | node1    | default  | host=node2 dbname=repmgr user=repmgr
     3  | node3 | standby |   running | node1    | default  | host=node3 dbname=repmgr user=repmgr</programlisting>
 </para>
 <para>
  Start <application>repmgrd</application> on each standby and verify that it's running by examining the
  log output, which at log level <literal>INFO</literal> will look like this:
  <programlisting>
    [2017-08-24 17:31:00] [NOTICE] using configuration file "/etc/repmgr.conf"
    [2017-08-24 17:31:00] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr"
    [2017-08-24 17:31:00] [NOTICE] starting monitoring of node <literal>node2</literal> (ID: 2)
    [2017-08-24 17:31:00] [INFO] monitoring connection to upstream node "node1" (node ID: 1)</programlisting>
 </para>
 <para>
  Each <application>repmgrd</application> should also have recorded its successful startup as an event:
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start
     Node ID | Name  | Event         | OK | Timestamp           | Details
    ---------+-------+---------------+----+---------------------+-------------------------------------------------------------
     3       | node3 | repmgrd_start | t  | 2017-08-24 17:35:54 | monitoring connection to upstream node "node1" (node ID: 1)
     2       | node2 | repmgrd_start | t  | 2017-08-24 17:35:50 | monitoring connection to upstream node "node1" (node ID: 1)
     1       | node1 | repmgrd_start | t  | 2017-08-24 17:35:46 | monitoring cluster primary "node1" (node ID: 1)  </programlisting>
 </para>
 <para>
  Now stop the current primary server with e.g.:
  <programlisting>
    pg_ctl -D /var/lib/postgresql/data -m immediate stop</programlisting>
 </para>
 <para>
  This will force the primary to shut down straight away, aborting all processes
  and transactions.  This will cause a flurry of activity in the <application>repmgrd</application> log
  files as each <application>repmgrd</application> detects the failure of the primary and a failover
  decision is made. This is an extract from the log of a standby server (<literal>node2</literal>)
  which has promoted to new primary after failure of the original primary (<literal>node1</literal>).
  <programlisting>
    [2017-08-24 23:32:01] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2017-08-24 23:32:08] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
    [2017-08-24 23:32:08] [INFO] checking state of node 1, 1 of 5 attempts
    [2017-08-24 23:32:08] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:09] [INFO] checking state of node 1, 2 of 5 attempts
    [2017-08-24 23:32:09] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:10] [INFO] checking state of node 1, 3 of 5 attempts
    [2017-08-24 23:32:10] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:11] [INFO] checking state of node 1, 4 of 5 attempts
    [2017-08-24 23:32:11] [INFO] sleeping 1 seconds until next reconnection attempt
    [2017-08-24 23:32:12] [INFO] checking state of node 1, 5 of 5 attempts
    [2017-08-24 23:32:12] [WARNING] unable to reconnect to node 1 after 5 attempts
    INFO:  setting voting term to 1
    INFO:  node 2 is candidate
    INFO:  node 3 has received request from node 2 for electoral term 1 (our term: 0)
    [2017-08-24 23:32:12] [NOTICE] this node is the winner, will now promote self and inform other nodes
    INFO: connecting to standby database
    NOTICE: promoting standby
    DETAIL: promoting server using 'pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' promote'
    INFO: reconnecting to promoted server
    NOTICE: STANDBY PROMOTE successful
    DETAIL: node 2 was successfully promoted to primary
    INFO:  node 3 received notification to follow node 2
    [2017-08-24 23:32:13] [INFO] switching to primary monitoring mode</programlisting>
 </para>
 <para>
  The cluster status will now look like this, with the original primary (<literal>node1</literal>)
  marked as inactive, and standby <literal>node3</literal> now following the new primary
  (<literal>node2</literal>):
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string
    ----+-------+---------+-----------+----------+----------+----------------------------------------------------
     1  | node1 | primary | - failed  |          | default  | host=node1 dbname=repmgr user=repmgr
     2  | node2 | primary | * running |          | default  | host=node2 dbname=repmgr user=repmgr
     3  | node3 | standby |   running | node2    | default  | host=node3 dbname=repmgr user=repmgr</programlisting>
 </para>
 <para>
  <command>repmgr cluster event</command> will display a summary of what happened to each server
  during the failover:
  <programlisting>
    $ repmgr -f /etc/repmgr.conf cluster event
     Node ID | Name  | Event                    | OK | Timestamp           | Details
    ---------+-------+--------------------------+----+---------------------+-----------------------------------------------------------------------------------
     3       | node3 | repmgrd_failover_follow  | t  | 2017-08-24 23:32:16 | node 3 now following new upstream node 2
     3       | node3 | standby_follow           | t  | 2017-08-24 23:32:16 | node 3 is now attached to node 2
     2       | node2 | repmgrd_failover_promote | t  | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed
     2       | node2 | standby_promote          | t  | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary</programlisting>
 </para>
  </sect1>
 </chapter>
@@ -1,178 +0,0 @@
 <chapter id="repmgrd-pausing" xreflabel="Pausing repmgrd">
  <indexterm>
    <primary>repmgrd</primary>
    <secondary>pausing</secondary>
  </indexterm>
  <indexterm>
    <primary>pausing repmgrd</primary>
  </indexterm>
  <title>Pausing repmgrd</title>
  <para>
    In normal operation, <application>repmgrd</application> monitors the state of the
    PostgreSQL node it is running on, and will take appropriate action if problems
    are detected, e.g. (if so configured) promote the node to primary, if the existing
    primary has been determined as failed.
  </para>
  <para>
    However, <application>repmgrd</application> is unable to distinguish between
    planned outages (such as performing a <link linkend="performing-switchover">switchover</link>
    or installing PostgreSQL maintenance released), and an actual server outage. In versions prior to
    &repmgr; 4.2 it was necessary to stop <application>repmgrd</application> on all nodes (or at least
    on all nodes where <application>repmgrd</application> is
    <link linkend="repmgrd-automatic-failover">configured for automatic failover</link>)
    to prevent <application>repmgrd</application> from making unintentional changes to the
    replication cluster.
  </para>
  <para>
    From <link linkend="release-4.2">&repmgr; 4.2</link>, <application>repmgrd</application>
    can now be &quot;paused&quot;, i.e. instructed not to take any action such as performing a failover.
    This can be done from any node in the cluster, removing the need to stop/restart
    each <application>repmgrd</application> individually.
  </para>
  <note>
    <para>
      For major PostgreSQL upgrades, e.g. from PostgreSQL 10 to PostgreSQL 11,
      <application>repmgrd</application> should be shut down completely and only started up
      once the &repmgr; packages for the new PostgreSQL major version have been installed.
    </para>
  </note>
  <sect1 id="repmgrd-pausing-prerequisites">
    <title>Prerequisites for pausing <application>repmgrd</application></title>
    <para>
      In order to be able to pause/unpause <application>repmgrd</application>, following
      prerequisites must be met:
      <itemizedlist spacing="compact" mark="bullet">
        <listitem>
          <simpara><link linkend="release-4.2">&repmgr; 4.2</link> or later must be installed on all nodes.</simpara>
        </listitem>
        <listitem>
          <simpara>The same major &repmgr; version (e.g. 4.2) must be installed on all nodes (and preferably the same minor version).</simpara>
        </listitem>
        <listitem>
          <simpara>
            PostgreSQL on all nodes must be accessible from the node where the
            <literal>pause</literal>/<literal>unpause</literal> operation is executed, using the
            <varname>conninfo</varname> string shown by <link linkend="repmgr-cluster-show"><command>repmgr cluster show</command></link>.
          </simpara>
        </listitem>
      </itemizedlist>
    </para>
    <note>
      <para>
        These conditions are required for normal &repmgr; operation in any case.
      </para>
    </note>
  </sect1>
  <sect1 id="repmgrd-pausing-execution">
    <title>Pausing/unpausing <application>repmgrd</application></title>
    <para>
      To pause <application>repmgrd</application>, execute <link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link>, e.g.:
   <programlisting>
 $ repmgr -f /etc/repmgr.conf daemon pause
 NOTICE: node 1 (node1) paused
 NOTICE: node 2 (node2) paused
 NOTICE: node 3 (node3) paused</programlisting>
    </para>
    <para>
      The state of <application>repmgrd</application> on each node can be checked with
      <link linkend="repmgr-daemon-status"><command>repmgr daemon status</command></link>, e.g.:
    <programlisting>$ repmgr -f /etc/repmgr.conf daemon status
 ID | Name  | Role    | Status  | repmgrd | PID  | Paused?
 ----+-------+---------+---------+---------+------+---------
 1  | node1 | primary | running | running | 7851 | yes
 2  | node2 | standby | running | running | 7889 | yes
 3  | node3 | standby | running | running | 7918 | yes</programlisting>
    </para>
    <note>
      <para>
        If executing a switchover with  <link linkend="repmgr-standby-switchover"><command>repmgr standby switchover</command></link>,
 		&repmgr; will automatically pause/unpause <application>repmgrd</application> as part of the switchover process.
      </para>
    </note>
    <para>
      If the primary (in this example, <literal>node1</literal>) is stopped, <application>repmgrd</application>
      running on one of the standbys (here: <literal>node2</literal>) will react like this:
      <programlisting>
 [2018-09-20 12:22:21] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
 [2018-09-20 12:22:21] [INFO] checking state of node 1, 1 of 5 attempts
 [2018-09-20 12:22:21] [INFO] sleeping 1 seconds until next reconnection attempt
 ...
 [2018-09-20 12:22:24] [INFO] sleeping 1 seconds until next reconnection attempt
 [2018-09-20 12:22:25] [INFO] checking state of node 1, 5 of 5 attempts
 [2018-09-20 12:22:25] [WARNING] unable to reconnect to node 1 after 5 attempts
 [2018-09-20 12:22:25] [NOTICE] node is paused
 [2018-09-20 12:22:33] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state
 [2018-09-20 12:22:33] [DETAIL] repmgrd paused by administrator
 [2018-09-20 12:22:33] [HINT] execute "repmgr daemon unpause" to resume normal failover mode</programlisting>
    </para>
    <para>
      If the primary becomes available again (e.g. following a software upgrade), <application>repmgrd</application>
      will automatically reconnect, e.g.:
      <programlisting>
 [2018-09-20 13:12:41] [NOTICE] reconnected to upstream node 1 after 8 seconds, resuming monitoring</programlisting>
    </para>
    <para>
      To unpause <application>repmgrd</application>, execute <link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>, e.g.:
   <programlisting>
 $ repmgr -f /etc/repmgr.conf daemon unpause
 NOTICE: node 1 (node1) unpaused
 NOTICE: node 2 (node2) unpaused
 NOTICE: node 3 (node3) unpaused</programlisting>
    </para>
    <note>
      <para>
        If the previous primary is no longer accessible when <application>repmgrd</application>
        is unpaused, no failover action will be taken. Instead, a new primary must be manually promoted using
        <link linkend="repmgr-standby-promote"><command>repmgr standby promote</command></link>,
 		and any standbys attached to the new primary with
 		<link linkend="repmgr-standby-follow"><command>repmgr standby follow</command></link>.
      </para>
      <para>
        This is to prevent <link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>
        resulting in the automatic promotion of a new primary, which may be a problem particularly
        in larger clusters, where <application>repmgrd</application> could select a different promotion
        candidate to the one intended by the administrator.
      </para>
    </note>
  <sect2 id="repmgrd-pausing-details">
    <title>Details on the <application>repmgrd</application> pausing mechanism</title>
    <para>
      The pause state of each node will be stored over a PostgreSQL restart.
    </para>
 	<para>
 	  <link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link> and
 	  <link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link> can be
 	  executed even if <application>repmgrd</application> is not running; in this case,
 	  <application>repmgrd</application> will start up in whichever pause state has been set.
 	</para>
    <note>
      <para>
 		<link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link> and
 		<link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>
 		<emphasis>do not</emphasis> stop/start <application>repmgrd</application>.
      </para>
    </note>
  </sect2>
  </sect1>
 </chapter>