mirror of
https://github.com/EnterpriseDB/repmgr.git
synced 2026-03-26 08:36:30 +00:00
doc: merge repmgrd pause documentation into overview
This commit is contained in:
@@ -53,12 +53,11 @@
|
|||||||
<!ENTITY repmgrd-overview SYSTEM "repmgrd-overview.sgml">
|
<!ENTITY repmgrd-overview SYSTEM "repmgrd-overview.sgml">
|
||||||
<!ENTITY repmgrd-automatic-failover SYSTEM "repmgrd-automatic-failover.sgml">
|
<!ENTITY repmgrd-automatic-failover SYSTEM "repmgrd-automatic-failover.sgml">
|
||||||
<!ENTITY repmgrd-configuration SYSTEM "repmgrd-configuration.sgml">
|
<!ENTITY repmgrd-configuration SYSTEM "repmgrd-configuration.sgml">
|
||||||
<!ENTITY repmgrd-demonstration SYSTEM "repmgrd-demonstration.sgml">
|
<!ENTITY repmgrd-operation SYSTEM "repmgrd-operation.sgml">
|
||||||
<!ENTITY repmgrd-monitoring SYSTEM "repmgrd-monitoring.sgml">
|
<!ENTITY repmgrd-monitoring SYSTEM "repmgrd-monitoring.sgml">
|
||||||
<!ENTITY repmgrd-degraded-monitoring SYSTEM "repmgrd-degraded-monitoring.sgml">
|
<!ENTITY repmgrd-degraded-monitoring SYSTEM "repmgrd-degraded-monitoring.sgml">
|
||||||
<!ENTITY repmgrd-network-split SYSTEM "repmgrd-network-split.sgml">
|
<!ENTITY repmgrd-network-split SYSTEM "repmgrd-network-split.sgml">
|
||||||
<!ENTITY repmgrd-witness-server SYSTEM "repmgrd-witness-server.sgml">
|
<!ENTITY repmgrd-witness-server SYSTEM "repmgrd-witness-server.sgml">
|
||||||
<!ENTITY repmgrd-pausing SYSTEM "repmgrd-pausing.sgml">
|
|
||||||
<!ENTITY repmgrd-notes SYSTEM "repmgrd-notes.sgml">
|
<!ENTITY repmgrd-notes SYSTEM "repmgrd-notes.sgml">
|
||||||
<!ENTITY repmgrd-bdr SYSTEM "repmgrd-bdr.sgml">
|
<!ENTITY repmgrd-bdr SYSTEM "repmgrd-bdr.sgml">
|
||||||
|
|
||||||
|
|||||||
@@ -83,10 +83,9 @@
|
|||||||
&repmgrd-overview;
|
&repmgrd-overview;
|
||||||
&repmgrd-automatic-failover;
|
&repmgrd-automatic-failover;
|
||||||
&repmgrd-configuration;
|
&repmgrd-configuration;
|
||||||
&repmgrd-demonstration;
|
&repmgrd-operation;
|
||||||
&repmgrd-network-split;
|
&repmgrd-network-split;
|
||||||
&repmgrd-witness-server;
|
&repmgrd-witness-server;
|
||||||
&repmgrd-pausing;
|
|
||||||
&repmgrd-degraded-monitoring;
|
&repmgrd-degraded-monitoring;
|
||||||
&repmgrd-monitoring;
|
&repmgrd-monitoring;
|
||||||
&repmgrd-notes;
|
&repmgrd-notes;
|
||||||
|
|||||||
@@ -1,17 +1,115 @@
|
|||||||
|
<chapter id="repmgrd-overview" xreflabel="repmgrd overview">
|
||||||
|
<indexterm>
|
||||||
|
<primary>repmgrd</primary>
|
||||||
|
<secondary>overview</secondary>
|
||||||
|
</indexterm>
|
||||||
|
|
||||||
<chapter id="repmgrd-overview" xreflabel="Overview of repmgrd">
|
<title>repmgrd overview</title>
|
||||||
<indexterm>
|
|
||||||
<primary>repmgrd</primary>
|
|
||||||
<secondary>overview</secondary>
|
|
||||||
</indexterm>
|
|
||||||
|
|
||||||
<title>repmgrd overview</title>
|
<para>
|
||||||
|
<application>repmgrd</application> ("<literal>replication manager daemon</literal>")
|
||||||
|
is a management and monitoring daemon which runs
|
||||||
|
on each node in a replication cluster. It can automate actions such as
|
||||||
|
failover and updating standbys to follow the new primary, as well as
|
||||||
|
providing monitoring information about the state of each standby.
|
||||||
|
</para>
|
||||||
|
|
||||||
<para>
|
<sect1 id="repmgrd-demonstration">
|
||||||
<application>repmgrd</application> ("<literal>replication manager daemon</literal>")
|
|
||||||
is a management and monitoring daemon which runs
|
<title>repmgrd demonstration</title>
|
||||||
on each node in a replication cluster. It can automate actions such as
|
<para>
|
||||||
failover and updating standbys to follow the new primary, as well as
|
To demonstrate automatic failover, set up a 3-node replication cluster (one primary
|
||||||
providing monitoring information about the state of each standby.
|
and two standbys streaming directly from the primary) so that the cluster looks
|
||||||
|
something like this:
|
||||||
|
<programlisting>
|
||||||
|
$ repmgr -f /etc/repmgr.conf cluster show
|
||||||
|
ID | Name | Role | Status | Upstream | Location | Connection string
|
||||||
|
----+-------+---------+-----------+----------+----------+--------------------------------------
|
||||||
|
1 | node1 | primary | * running | | default | host=node1 dbname=repmgr user=repmgr
|
||||||
|
2 | node2 | standby | running | node1 | default | host=node2 dbname=repmgr user=repmgr
|
||||||
|
3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr</programlisting>
|
||||||
</para>
|
</para>
|
||||||
|
<para>
|
||||||
|
Start <application>repmgrd</application> on each standby and verify that it's running by examining the
|
||||||
|
log output, which at log level <literal>INFO</literal> will look like this:
|
||||||
|
<programlisting>
|
||||||
|
[2017-08-24 17:31:00] [NOTICE] using configuration file "/etc/repmgr.conf"
|
||||||
|
[2017-08-24 17:31:00] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr"
|
||||||
|
[2017-08-24 17:31:00] [NOTICE] starting monitoring of node <literal>node2</literal> (ID: 2)
|
||||||
|
[2017-08-24 17:31:00] [INFO] monitoring connection to upstream node "node1" (node ID: 1)</programlisting>
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Each <application>repmgrd</application> should also have recorded its successful startup as an event:
|
||||||
|
<programlisting>
|
||||||
|
$ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start
|
||||||
|
Node ID | Name | Event | OK | Timestamp | Details
|
||||||
|
---------+-------+---------------+----+---------------------+-------------------------------------------------------------
|
||||||
|
3 | node3 | repmgrd_start | t | 2017-08-24 17:35:54 | monitoring connection to upstream node "node1" (node ID: 1)
|
||||||
|
2 | node2 | repmgrd_start | t | 2017-08-24 17:35:50 | monitoring connection to upstream node "node1" (node ID: 1)
|
||||||
|
1 | node1 | repmgrd_start | t | 2017-08-24 17:35:46 | monitoring cluster primary "node1" (node ID: 1) </programlisting>
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
Now stop the current primary server with e.g.:
|
||||||
|
<programlisting>
|
||||||
|
pg_ctl -D /var/lib/postgresql/data -m immediate stop</programlisting>
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
This will force the primary to shut down straight away, aborting all processes
|
||||||
|
and transactions. This will cause a flurry of activity in the <application>repmgrd</application> log
|
||||||
|
files as each <application>repmgrd</application> detects the failure of the primary and a failover
|
||||||
|
decision is made. This is an extract from the log of a standby server (<literal>node2</literal>)
|
||||||
|
which has promoted to new primary after failure of the original primary (<literal>node1</literal>).
|
||||||
|
<programlisting>
|
||||||
|
[2017-08-24 23:32:01] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
|
||||||
|
[2017-08-24 23:32:08] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
|
||||||
|
[2017-08-24 23:32:08] [INFO] checking state of node 1, 1 of 5 attempts
|
||||||
|
[2017-08-24 23:32:08] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||||
|
[2017-08-24 23:32:09] [INFO] checking state of node 1, 2 of 5 attempts
|
||||||
|
[2017-08-24 23:32:09] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||||
|
[2017-08-24 23:32:10] [INFO] checking state of node 1, 3 of 5 attempts
|
||||||
|
[2017-08-24 23:32:10] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||||
|
[2017-08-24 23:32:11] [INFO] checking state of node 1, 4 of 5 attempts
|
||||||
|
[2017-08-24 23:32:11] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||||
|
[2017-08-24 23:32:12] [INFO] checking state of node 1, 5 of 5 attempts
|
||||||
|
[2017-08-24 23:32:12] [WARNING] unable to reconnect to node 1 after 5 attempts
|
||||||
|
INFO: setting voting term to 1
|
||||||
|
INFO: node 2 is candidate
|
||||||
|
INFO: node 3 has received request from node 2 for electoral term 1 (our term: 0)
|
||||||
|
[2017-08-24 23:32:12] [NOTICE] this node is the winner, will now promote self and inform other nodes
|
||||||
|
INFO: connecting to standby database
|
||||||
|
NOTICE: promoting standby
|
||||||
|
DETAIL: promoting server using 'pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' promote'
|
||||||
|
INFO: reconnecting to promoted server
|
||||||
|
NOTICE: STANDBY PROMOTE successful
|
||||||
|
DETAIL: node 2 was successfully promoted to primary
|
||||||
|
INFO: node 3 received notification to follow node 2
|
||||||
|
[2017-08-24 23:32:13] [INFO] switching to primary monitoring mode</programlisting>
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
The cluster status will now look like this, with the original primary (<literal>node1</literal>)
|
||||||
|
marked as inactive, and standby <literal>node3</literal> now following the new primary
|
||||||
|
(<literal>node2</literal>):
|
||||||
|
<programlisting>
|
||||||
|
$ repmgr -f /etc/repmgr.conf cluster show
|
||||||
|
ID | Name | Role | Status | Upstream | Location | Connection string
|
||||||
|
----+-------+---------+-----------+----------+----------+----------------------------------------------------
|
||||||
|
1 | node1 | primary | - failed | | default | host=node1 dbname=repmgr user=repmgr
|
||||||
|
2 | node2 | primary | * running | | default | host=node2 dbname=repmgr user=repmgr
|
||||||
|
3 | node3 | standby | running | node2 | default | host=node3 dbname=repmgr user=repmgr</programlisting>
|
||||||
|
|
||||||
|
</para>
|
||||||
|
<para>
|
||||||
|
<command>repmgr cluster event</command> will display a summary of what happened to each server
|
||||||
|
during the failover:
|
||||||
|
<programlisting>
|
||||||
|
$ repmgr -f /etc/repmgr.conf cluster event
|
||||||
|
Node ID | Name | Event | OK | Timestamp | Details
|
||||||
|
---------+-------+--------------------------+----+---------------------+-----------------------------------------------------------------------------------
|
||||||
|
3 | node3 | repmgrd_failover_follow | t | 2017-08-24 23:32:16 | node 3 now following new upstream node 2
|
||||||
|
3 | node3 | standby_follow | t | 2017-08-24 23:32:16 | node 3 is now attached to node 2
|
||||||
|
2 | node2 | repmgrd_failover_promote | t | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed
|
||||||
|
2 | node2 | standby_promote | t | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary</programlisting>
|
||||||
|
</para>
|
||||||
|
|
||||||
|
</sect1>
|
||||||
</chapter>
|
</chapter>
|
||||||
|
|||||||
@@ -1,178 +0,0 @@
|
|||||||
<chapter id="repmgrd-pausing" xreflabel="Pausing repmgrd">
|
|
||||||
|
|
||||||
<indexterm>
|
|
||||||
<primary>repmgrd</primary>
|
|
||||||
<secondary>pausing</secondary>
|
|
||||||
</indexterm>
|
|
||||||
|
|
||||||
<indexterm>
|
|
||||||
<primary>pausing repmgrd</primary>
|
|
||||||
</indexterm>
|
|
||||||
|
|
||||||
<title>Pausing repmgrd</title>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
In normal operation, <application>repmgrd</application> monitors the state of the
|
|
||||||
PostgreSQL node it is running on, and will take appropriate action if problems
|
|
||||||
are detected, e.g. (if so configured) promote the node to primary, if the existing
|
|
||||||
primary has been determined as failed.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
However, <application>repmgrd</application> is unable to distinguish between
|
|
||||||
planned outages (such as performing a <link linkend="performing-switchover">switchover</link>
|
|
||||||
or installing PostgreSQL maintenance released), and an actual server outage. In versions prior to
|
|
||||||
&repmgr; 4.2 it was necessary to stop <application>repmgrd</application> on all nodes (or at least
|
|
||||||
on all nodes where <application>repmgrd</application> is
|
|
||||||
<link linkend="repmgrd-automatic-failover">configured for automatic failover</link>)
|
|
||||||
to prevent <application>repmgrd</application> from making unintentional changes to the
|
|
||||||
replication cluster.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
From <link linkend="release-4.2">&repmgr; 4.2</link>, <application>repmgrd</application>
|
|
||||||
can now be "paused", i.e. instructed not to take any action such as performing a failover.
|
|
||||||
This can be done from any node in the cluster, removing the need to stop/restart
|
|
||||||
each <application>repmgrd</application> individually.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<note>
|
|
||||||
<para>
|
|
||||||
For major PostgreSQL upgrades, e.g. from PostgreSQL 10 to PostgreSQL 11,
|
|
||||||
<application>repmgrd</application> should be shut down completely and only started up
|
|
||||||
once the &repmgr; packages for the new PostgreSQL major version have been installed.
|
|
||||||
</para>
|
|
||||||
</note>
|
|
||||||
|
|
||||||
<sect1 id="repmgrd-pausing-prerequisites">
|
|
||||||
<title>Prerequisites for pausing <application>repmgrd</application></title>
|
|
||||||
<para>
|
|
||||||
In order to be able to pause/unpause <application>repmgrd</application>, following
|
|
||||||
prerequisites must be met:
|
|
||||||
<itemizedlist spacing="compact" mark="bullet">
|
|
||||||
|
|
||||||
<listitem>
|
|
||||||
<simpara><link linkend="release-4.2">&repmgr; 4.2</link> or later must be installed on all nodes.</simpara>
|
|
||||||
</listitem>
|
|
||||||
|
|
||||||
<listitem>
|
|
||||||
<simpara>The same major &repmgr; version (e.g. 4.2) must be installed on all nodes (and preferably the same minor version).</simpara>
|
|
||||||
</listitem>
|
|
||||||
|
|
||||||
<listitem>
|
|
||||||
<simpara>
|
|
||||||
PostgreSQL on all nodes must be accessible from the node where the
|
|
||||||
<literal>pause</literal>/<literal>unpause</literal> operation is executed, using the
|
|
||||||
<varname>conninfo</varname> string shown by <link linkend="repmgr-cluster-show"><command>repmgr cluster show</command></link>.
|
|
||||||
</simpara>
|
|
||||||
</listitem>
|
|
||||||
</itemizedlist>
|
|
||||||
</para>
|
|
||||||
<note>
|
|
||||||
<para>
|
|
||||||
These conditions are required for normal &repmgr; operation in any case.
|
|
||||||
</para>
|
|
||||||
</note>
|
|
||||||
|
|
||||||
</sect1>
|
|
||||||
|
|
||||||
<sect1 id="repmgrd-pausing-execution">
|
|
||||||
<title>Pausing/unpausing <application>repmgrd</application></title>
|
|
||||||
<para>
|
|
||||||
To pause <application>repmgrd</application>, execute <link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link>, e.g.:
|
|
||||||
<programlisting>
|
|
||||||
$ repmgr -f /etc/repmgr.conf daemon pause
|
|
||||||
NOTICE: node 1 (node1) paused
|
|
||||||
NOTICE: node 2 (node2) paused
|
|
||||||
NOTICE: node 3 (node3) paused</programlisting>
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
The state of <application>repmgrd</application> on each node can be checked with
|
|
||||||
<link linkend="repmgr-daemon-status"><command>repmgr daemon status</command></link>, e.g.:
|
|
||||||
<programlisting>$ repmgr -f /etc/repmgr.conf daemon status
|
|
||||||
ID | Name | Role | Status | repmgrd | PID | Paused?
|
|
||||||
----+-------+---------+---------+---------+------+---------
|
|
||||||
1 | node1 | primary | running | running | 7851 | yes
|
|
||||||
2 | node2 | standby | running | running | 7889 | yes
|
|
||||||
3 | node3 | standby | running | running | 7918 | yes</programlisting>
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<note>
|
|
||||||
<para>
|
|
||||||
If executing a switchover with <link linkend="repmgr-standby-switchover"><command>repmgr standby switchover</command></link>,
|
|
||||||
&repmgr; will automatically pause/unpause <application>repmgrd</application> as part of the switchover process.
|
|
||||||
</para>
|
|
||||||
</note>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
If the primary (in this example, <literal>node1</literal>) is stopped, <application>repmgrd</application>
|
|
||||||
running on one of the standbys (here: <literal>node2</literal>) will react like this:
|
|
||||||
<programlisting>
|
|
||||||
[2018-09-20 12:22:21] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
|
|
||||||
[2018-09-20 12:22:21] [INFO] checking state of node 1, 1 of 5 attempts
|
|
||||||
[2018-09-20 12:22:21] [INFO] sleeping 1 seconds until next reconnection attempt
|
|
||||||
...
|
|
||||||
[2018-09-20 12:22:24] [INFO] sleeping 1 seconds until next reconnection attempt
|
|
||||||
[2018-09-20 12:22:25] [INFO] checking state of node 1, 5 of 5 attempts
|
|
||||||
[2018-09-20 12:22:25] [WARNING] unable to reconnect to node 1 after 5 attempts
|
|
||||||
[2018-09-20 12:22:25] [NOTICE] node is paused
|
|
||||||
[2018-09-20 12:22:33] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state
|
|
||||||
[2018-09-20 12:22:33] [DETAIL] repmgrd paused by administrator
|
|
||||||
[2018-09-20 12:22:33] [HINT] execute "repmgr daemon unpause" to resume normal failover mode</programlisting>
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
If the primary becomes available again (e.g. following a software upgrade), <application>repmgrd</application>
|
|
||||||
will automatically reconnect, e.g.:
|
|
||||||
<programlisting>
|
|
||||||
[2018-09-20 13:12:41] [NOTICE] reconnected to upstream node 1 after 8 seconds, resuming monitoring</programlisting>
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
To unpause <application>repmgrd</application>, execute <link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>, e.g.:
|
|
||||||
<programlisting>
|
|
||||||
$ repmgr -f /etc/repmgr.conf daemon unpause
|
|
||||||
NOTICE: node 1 (node1) unpaused
|
|
||||||
NOTICE: node 2 (node2) unpaused
|
|
||||||
NOTICE: node 3 (node3) unpaused</programlisting>
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<note>
|
|
||||||
<para>
|
|
||||||
If the previous primary is no longer accessible when <application>repmgrd</application>
|
|
||||||
is unpaused, no failover action will be taken. Instead, a new primary must be manually promoted using
|
|
||||||
<link linkend="repmgr-standby-promote"><command>repmgr standby promote</command></link>,
|
|
||||||
and any standbys attached to the new primary with
|
|
||||||
<link linkend="repmgr-standby-follow"><command>repmgr standby follow</command></link>.
|
|
||||||
</para>
|
|
||||||
<para>
|
|
||||||
This is to prevent <link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>
|
|
||||||
resulting in the automatic promotion of a new primary, which may be a problem particularly
|
|
||||||
in larger clusters, where <application>repmgrd</application> could select a different promotion
|
|
||||||
candidate to the one intended by the administrator.
|
|
||||||
</para>
|
|
||||||
</note>
|
|
||||||
|
|
||||||
<sect2 id="repmgrd-pausing-details">
|
|
||||||
<title>Details on the <application>repmgrd</application> pausing mechanism</title>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
The pause state of each node will be stored over a PostgreSQL restart.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
<link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link> and
|
|
||||||
<link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link> can be
|
|
||||||
executed even if <application>repmgrd</application> is not running; in this case,
|
|
||||||
<application>repmgrd</application> will start up in whichever pause state has been set.
|
|
||||||
</para>
|
|
||||||
<note>
|
|
||||||
<para>
|
|
||||||
<link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link> and
|
|
||||||
<link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>
|
|
||||||
<emphasis>do not</emphasis> stop/start <application>repmgrd</application>.
|
|
||||||
</para>
|
|
||||||
</note>
|
|
||||||
</sect2>
|
|
||||||
</sect1>
|
|
||||||
</chapter>
|
|
||||||
|
|
||||||
Reference in New Issue
Block a user