mirror of
https://github.com/EnterpriseDB/repmgr.git
synced 2026-03-22 22:56:29 +00:00
Fixed a mising closing quote and closing parenthesis in the repmgrd overview section of the documentation. Signed-off-by: Martín Marqués <martin.marques@enterprisedb.com>
188 lines
9.2 KiB
XML
188 lines
9.2 KiB
XML
<chapter id="repmgrd-overview" xreflabel="repmgrd overview">
|
|
<title>repmgrd overview</title>
|
|
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>overview</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
&repmgrd; ("<literal>replication manager daemon</literal>")
|
|
is a management and monitoring daemon which runs
|
|
on each node in a replication cluster. It can automate actions such as
|
|
failover and updating standbys to follow the new primary, as well as
|
|
providing monitoring information about the state of each standby.
|
|
</para>
|
|
<para>
|
|
&repmgrd; is designed to be straightforward to set up
|
|
and does not require additional external infrastructure.
|
|
</para>
|
|
<para>
|
|
Functionality provided by &repmgrd; includes:
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
<simpara>
|
|
wide range of <link linkend="repmgrd-basic-configuration">configuration options</link>
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
option to execute custom scripts ("<link linkend="event-notifications">event notifications</link>")
|
|
at different points in the failover sequence
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
ability to <link linkend="repmgrd-pausing">pause repmgrd</link>
|
|
operation on all nodes with a
|
|
<link linkend="repmgr-service-pause"><command>single command</command></link>
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
optional <link linkend="repmgrd-witness-server">witness server</link>
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
"location" configuration option to restrict
|
|
potential promotion candidates to a single location
|
|
(e.g. when nodes are spread over multiple data centres)
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
<link linkend="connection-check-type">choice of method</link> to determine node availability
|
|
(PostgreSQL ping, query execution or new connection)
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
retention of monitoring statistics (optional)
|
|
</simpara>
|
|
</listitem>
|
|
|
|
|
|
</itemizedlist>
|
|
|
|
</para>
|
|
|
|
<sect1 id="repmgrd-demonstration">
|
|
|
|
<title>repmgrd demonstration</title>
|
|
<para>
|
|
To demonstrate automatic failover, set up a 3-node replication cluster (one primary
|
|
and two standbys streaming directly from the primary) so that the cluster looks
|
|
something like this:
|
|
<programlisting>
|
|
$ repmgr -f /etc/repmgr.conf cluster show --compact
|
|
ID | Name | Role | Status | Upstream | Location | Prio.
|
|
----+-------+---------+-----------+----------+----------+-------
|
|
1 | node1 | primary | * running | | default | 100
|
|
2 | node2 | standby | running | node1 | default | 100
|
|
3 | node3 | standby | running | node1 | default | 100</programlisting>
|
|
</para>
|
|
|
|
<tip>
|
|
<para>
|
|
See section <link linkend="repmgrd-automatic-failover-configuration">Required configuration for automatic failover</link>
|
|
for an example of minimal <filename>repmgr.conf</filename> file settings suitable for use with &repmgrd;.
|
|
</para>
|
|
</tip>
|
|
<para>
|
|
Start &repmgrd; on each standby and verify that it's running by examining the
|
|
log output, which at log level <literal>INFO</literal> will look like this:
|
|
<programlisting>
|
|
[2019-08-15 07:14:42] [NOTICE] repmgrd (repmgrd 5.0) starting up
|
|
[2019-08-15 07:14:42] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr connect_timeout=2"
|
|
INFO: set_repmgrd_pid(): provided pidfile is /var/run/repmgr/repmgrd-12.pid
|
|
[2019-08-15 07:14:42] [NOTICE] starting monitoring of node "node2" (ID: 2)
|
|
[2019-08-15 07:14:42] [INFO] monitoring connection to upstream node "node1" (ID: 1)</programlisting>
|
|
</para>
|
|
<para>
|
|
Each &repmgrd; should also have recorded its successful startup as an event:
|
|
<programlisting>
|
|
$ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start
|
|
Node ID | Name | Event | OK | Timestamp | Details
|
|
---------+-------+---------------+----+---------------------+--------------------------------------------------------
|
|
3 | node3 | repmgrd_start | t | 2019-08-15 07:14:42 | monitoring connection to upstream node "node1" (ID: 1)
|
|
2 | node2 | repmgrd_start | t | 2019-08-15 07:14:41 | monitoring connection to upstream node "node1" (ID: 1)
|
|
1 | node1 | repmgrd_start | t | 2019-08-15 07:14:39 | monitoring cluster primary "node1" (ID: 1)</programlisting>
|
|
</para>
|
|
<para>
|
|
Now stop the current primary server with e.g.:
|
|
<programlisting>
|
|
pg_ctl -D /var/lib/postgresql/data -m immediate stop</programlisting>
|
|
</para>
|
|
<para>
|
|
This will force the primary to shut down straight away, aborting all processes
|
|
and transactions. This will cause a flurry of activity in the &repmgrd; log
|
|
files as each &repmgrd; detects the failure of the primary and a failover
|
|
decision is made. This is an extract from the log of a standby server (<literal>node2</literal>)
|
|
which has promoted to new primary after failure of the original primary (<literal>node1</literal>).
|
|
<programlisting>
|
|
[2019-08-15 07:27:50] [WARNING] unable to connect to upstream node "node1" (ID: 1)
|
|
[2019-08-15 07:27:50] [INFO] checking state of node 1, 1 of 3 attempts
|
|
[2019-08-15 07:27:50] [INFO] sleeping 5 seconds until next reconnection attempt
|
|
[2019-08-15 07:27:55] [INFO] checking state of node 1, 2 of 3 attempts
|
|
[2019-08-15 07:27:55] [INFO] sleeping 5 seconds until next reconnection attempt
|
|
[2019-08-15 07:28:00] [INFO] checking state of node 1, 3 of 3 attempts
|
|
[2019-08-15 07:28:00] [WARNING] unable to reconnect to node 1 after 3 attempts
|
|
[2019-08-15 07:28:00] [INFO] primary and this node have the same location ("default")
|
|
[2019-08-15 07:28:00] [INFO] local node's last receive lsn: 0/900CBF8
|
|
[2019-08-15 07:28:00] [INFO] node 3 last saw primary node 12 second(s) ago
|
|
[2019-08-15 07:28:00] [INFO] last receive LSN for sibling node "node3" (ID: 3) is: 0/900CBF8
|
|
[2019-08-15 07:28:00] [INFO] node "node3" (ID: 3) has same LSN as current candidate "node2" (ID: 2)
|
|
[2019-08-15 07:28:00] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
|
|
[2019-08-15 07:28:00] [NOTICE] promotion candidate is "node2" (ID: 2)
|
|
[2019-08-15 07:28:00] [NOTICE] this node is the winner, will now promote itself and inform other nodes
|
|
[2019-08-15 07:28:00] [INFO] promote_command is:
|
|
"/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote"
|
|
NOTICE: promoting standby to primary
|
|
DETAIL: promoting server "node2" (ID: 2) using "/usr/pgsql-12/bin/pg_ctl -w -D '/var/lib/pgsql/12/data' promote"
|
|
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
|
|
NOTICE: STANDBY PROMOTE successful
|
|
DETAIL: server "node2" (ID: 2) was successfully promoted to primary
|
|
[2019-08-15 07:28:01] [INFO] 3 followers to notify
|
|
[2019-08-15 07:28:01] [NOTICE] notifying node "node3" (ID: 3) to follow node 2
|
|
INFO: node 3 received notification to follow node 2
|
|
[2019-08-15 07:28:01] [INFO] switching to primary monitoring mode
|
|
[2019-08-15 07:28:01] [NOTICE] monitoring cluster primary "node2" (ID: 2)</programlisting>
|
|
</para>
|
|
<para>
|
|
The cluster status will now look like this, with the original primary (<literal>node1</literal>)
|
|
marked as inactive, and standby <literal>node3</literal> now following the new primary
|
|
(<literal>node2</literal>):
|
|
<programlisting>
|
|
$ repmgr -f /etc/repmgr.conf cluster show --compact
|
|
ID | Name | Role | Status | Upstream | Location | Prio.
|
|
----+-------+---------+-----------+----------+----------+-------
|
|
1 | node1 | primary | - failed | | default | 100
|
|
2 | node2 | primary | * running | | default | 100
|
|
3 | node3 | standby | running | node2 | default | 100</programlisting>
|
|
|
|
</para>
|
|
<para>
|
|
<link linkend="repmgr-cluster-event"><command>repmgr cluster event</command></link> will display a summary of
|
|
what happened to each server during the failover:
|
|
<programlisting>
|
|
$ repmgr -f /etc/repmgr.conf cluster event
|
|
Node ID | Name | Event | OK | Timestamp | Details
|
|
---------+-------+----------------------------+----+---------------------+-------------------------------------------------------------
|
|
3 | node3 | repmgrd_failover_follow | t | 2019-08-15 07:38:03 | node 3 now following new upstream node 2
|
|
3 | node3 | standby_follow | t | 2019-08-15 07:38:02 | standby attached to upstream node "node2" (ID: 2)
|
|
2 | node2 | repmgrd_reload | t | 2019-08-15 07:38:01 | monitoring cluster primary "node2" (ID: 2)
|
|
2 | node2 | repmgrd_failover_promote | t | 2019-08-15 07:38:01 | node 2 promoted to primary; old primary 1 marked as failed
|
|
2 | node2 | standby_promote | t | 2019-08-15 07:38:01 | server "node2" (ID: 2) was successfully promoted to primary</programlisting>
|
|
</para>
|
|
|
|
</sect1>
|
|
</chapter>
|