Add more repmgrd documentation

2026-05-31 19:39:04 +00:00 · 2017-10-05 15:59:39 +09:00
parent cf1e17d758
commit 08878831fe
6 changed files with 137 additions and 1 deletions
@@ -49,6 +49,9 @@
 <!ENTITY repmgrd-configuration SYSTEM "repmgrd-configuration.sgml">
 <!ENTITY repmgrd-demonstration SYSTEM "repmgrd-demonstration.sgml">
 <!ENTITY repmgrd-monitoring SYSTEM "repmgrd-monitoring.sgml">
+<!ENTITY repmgrd-degraded-monitoring SYSTEM "repmgrd-degraded-monitoring.sgml">
+<!ENTITY repmgrd-cascading-replication SYSTEM "repmgrd-cascading-replication.sgml">
+<!ENTITY repmgrd-network-split SYSTEM "repmgrd-network-split.sgml">

 <!ENTITY repmgr-primary-register SYSTEM "repmgr-primary-register.sgml">
 <!ENTITY repmgr-primary-unregister SYSTEM "repmgr-primary-unregister.sgml">
@@ -16,7 +16,8 @@
  <note>
   <simpara>
    Monitoring history will only be written if <command>repmgrd</command> is active, and
-    <varname>monitoring_history</varname> is set to <literal>true</literal> in <filename>repmgr.conf</filename>.
+    <varname>monitoring_history</varname> is set to <literal>true</literal> in
+    <filename>repmgr.conf</filename>.
   </simpara>
  </note>
 </chapter>
@@ -81,6 +81,9 @@
  &repmgrd-automatic-failover;
  &repmgrd-configuration;
  &repmgrd-demonstration;
+  &repmgrd-cascading-replication;
+  &repmgrd-network-split;
+  &repmgrd-degraded-monitoring;
  &repmgrd-monitoring;
 </part>

@@ -0,0 +1,17 @@
+<chapter id="repmgrd-cascading-replication">
+ <title>repmgrd and cascading replication</title>
+ <para>
+  Cascading replication - where a standby can connect to an upstream node and not
+  the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and
+  <command>repmgrd</command> support cascading replication by keeping track of the relationship
+  between standby servers - each node record is stored with the node id of its
+  upstream ("parent") server (except of course the primary server).
+ </para>
+ <para>
+  In a failover situation where the primary node fails and a top-level standby
+  is promoted, a standby connected to another standby will not be affected
+  and continue working as normal (even if the upstream standby it's connected
+  to becomes the primary node). If however the node's direct upstream fails,
+  the "cascaded standby" will attempt to reconnect to that node's parent.
+ </para>
+</chapter>
@@ -0,0 +1,69 @@
+<chapter id="repmgrd-degraded-monitoring">
+ <title>"degraded monitoring" mode</title>
+ <para>
+  In certain circumstances, `repmgrd` is not able to fulfill its primary mission
+  of monitoring the nodes' upstream server. In these cases it enters "degraded
+  monitoring" mode, where `repmgrd` remains active but is waiting for the situation
+  to be resolved.
+ </para>
+ <para>
+  Situations where this happens are:
+  <itemizedlist spacing="compact" mark="bullet">
+
+   <listitem>
+    <simpara>a failover situation has occurred, no nodes in the primary node's location are visible</simpara>
+   </listitem>
+
+   <listitem>
+    <simpara>a failover situation has occurred, but no promotion candidate is available</simpara>
+   </listitem>
+
+   <listitem>
+    <simpara>a failover situation has occurred, but the promotion candidate could not be promoted</simpara>
+   </listitem>
+
+   <listitem>
+    <simpara>a failover situation has occurred, but the node was unable to follow the new primary</simpara>
+   </listitem>
+
+   <listitem>
+    <simpara>a failover situation has occurred, but no primary has become available</simpara>
+   </listitem>
+
+   <listitem>
+    <simpara>a failover situation has occurred, but automatic failover is not enabled for the node</simpara>
+   </listitem>
+
+   <listitem>
+    <simpara>repmgrd is monitoring the primary node, but it is not available</simpara>
+   </listitem>
+  </itemizedlist>
+ </para>
+
+ <para>
+  Example output in a situation where there is only one standby with <literal>failover=manual</literal>,
+  and the primary node is unavailable (but is later restarted):
+  <programlisting>
+    [2017-08-29 10:59:19] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled)
+    [2017-08-29 10:59:33] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
+    [2017-08-29 10:59:33] [INFO] checking state of node 1, 1 of 5 attempts
+    [2017-08-29 10:59:33] [INFO] sleeping 1 seconds until next reconnection attempt
+    (...)
+    [2017-08-29 10:59:37] [INFO] checking state of node 1, 5 of 5 attempts
+    [2017-08-29 10:59:37] [WARNING] unable to reconnect to node 1 after 5 attempts
+    [2017-08-29 10:59:37] [NOTICE] this node is not configured for automatic failover so will not be considered as promotion candidate
+    [2017-08-29 10:59:37] [NOTICE] no other nodes are available as promotion candidate
+    [2017-08-29 10:59:37] [HINT] use "repmgr standby promote" to manually promote this node
+    [2017-08-29 10:59:37] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled)
+    [2017-08-29 10:59:53] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled)
+    [2017-08-29 11:00:45] [NOTICE] reconnected to upstream node 1 after 68 seconds, resuming monitoring
+    [2017-08-29 11:00:57] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled)</programlisting>
+
+ </para>
+ <para>
+  By default, <literal>repmgrd</literal> will continue in degraded monitoring mode indefinitely.
+  However a timeout (in seconds) can be set with <varname>degraded_monitoring_timeout</varname>.
+
+ </para>
+
+</chapter>
@@ -0,0 +1,43 @@
+<chapter id="repmgrd-network-split">
+ <title>Handling network splits with repmgrd</title>
+ <para>
+  A common pattern for replication cluster setups is to spread servers over
+  more than one datacentre. This can provide benefits such as geographically-
+  distributed read replicas and DR (disaster recovery capability). However
+  this also means there is a risk of disconnection at network level between
+  datacentre locations, which would result in a split-brain scenario if
+  servers in a secondary data centre were no longer able to see the primary
+  in the main data centre and promoted a standby among themselves.
+ </para>
+ <para>
+  Previous &repmgr; versions used the concept of a "witness server" to
+  artificially create a quorum of servers in a particular location, ensuring
+  that nodes in another location will not elect a new primary if they
+  are unable to see the majority of nodes. However this approach does not
+  scale well, particularly with more complex replication setups, e.g.
+  where the majority of nodes are located outside of the primary datacentre.
+  It also means the <literal>witness</literal> node needs to be managed as an
+  extra PostgreSQL instance outside of the main replication cluster, which
+  adds administrative and programming complexity.
+ </para>
+ <para>
+  <literal>repmgr4</literal> introduces the concept of <literal>location</literal>:
+  each node is associated with an arbitrary location string (default is
+  <literal>default</literal>); this is set in <filename>repmgr.conf</filename>, e.g.:
+  <programlisting>
+    node_id=1
+    node_name=node1
+    conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
+    data_directory='/var/lib/postgresql/data'
+    location='dc1'</programlisting>
+ </para>
+ <para>
+  In a failover situation, <command>repmgrd</command> will check if any servers in the
+  same location as the current primary node are visible.  If not, <command>repmgrd</command>
+  will assume a network interruption and not promote any node in any
+  other location (it will however enter <xref linkend="repmgrd-degraded-monitoring"> mode until
+  a primary becomes visible).
+ </para>
+
+</chapter>
+