mirror of
https://github.com/EnterpriseDB/repmgr.git
synced 2026-03-23 07:06:30 +00:00
As they are now XML files. In PostgreSQL itself they remain with the .sgml suffix for backwards compatibility, but that's not important for us.
188 lines
9.1 KiB
XML
188 lines
9.1 KiB
XML
<chapter id="repmgrd-overview" xreflabel="repmgrd overview">
|
|
<title>repmgrd overview</title>
|
|
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>overview</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
&repmgrd; ("<literal>replication manager daemon</literal>")
|
|
is a management and monitoring daemon which runs
|
|
on each node in a replication cluster. It can automate actions such as
|
|
failover and updating standbys to follow the new primary, as well as
|
|
providing monitoring information about the state of each standby.
|
|
</para>
|
|
<para>
|
|
&repmgrd; is designed to be straightforward to set up
|
|
and does not require additional external infrastructure.
|
|
</para>
|
|
<para>
|
|
Functionality provided by &repmgrd; includes:
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
<simpara>
|
|
wide range of <link linkend="repmgrd-basic-configuration">configuration options</link>
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
option to execute custom scripts ("<link linkend="event-notifications">event notifications</link>
|
|
at different points in the failover sequence
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
ability to <link linkend="repmgrd-pausing">pause repmgrd</link>
|
|
operation on all nodes with a
|
|
<link linkend="repmgr-daemon-pause"><command>single command</command></link>
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
optional <link linkend="repmgrd-witness-server">witness server</link>
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
"location" configuration option to restrict
|
|
potential promotion candidates to a single location
|
|
(e.g. when nodes are spread over multiple data centres)
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
<link linkend="connection-check-type">choice of method</link> to determine node availability
|
|
(PostgreSQL ping, query execution or new connection)
|
|
</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>
|
|
retention of monitoring statistics (optional)
|
|
</simpara>
|
|
</listitem>
|
|
|
|
|
|
</itemizedlist>
|
|
|
|
</para>
|
|
|
|
<sect1 id="repmgrd-demonstration">
|
|
|
|
<title>repmgrd demonstration</title>
|
|
<para>
|
|
To demonstrate automatic failover, set up a 3-node replication cluster (one primary
|
|
and two standbys streaming directly from the primary) so that the cluster looks
|
|
something like this:
|
|
<programlisting>
|
|
$ repmgr -f /etc/repmgr.conf cluster show --compact
|
|
ID | Name | Role | Status | Upstream | Location | Prio.
|
|
----+-------+---------+-----------+----------+----------+-------
|
|
1 | node1 | primary | * running | | default | 100
|
|
2 | node2 | standby | running | node1 | default | 100
|
|
3 | node3 | standby | running | node1 | default | 100</programlisting>
|
|
</para>
|
|
|
|
<tip>
|
|
<para>
|
|
See section <link linkend="repmgrd-automatic-failover-configuration">Required configuration for automatic failover</link>
|
|
for an example of minimal <filename>repmgr.conf</filename> file settings suitable for use with &repmgrd;.
|
|
</para>
|
|
</tip>
|
|
<para>
|
|
Start &repmgrd; on each standby and verify that it's running by examining the
|
|
log output, which at log level <literal>INFO</literal> will look like this:
|
|
<programlisting>
|
|
[2019-03-15 06:32:05] [NOTICE] repmgrd (repmgrd 4.3) starting up
|
|
[2019-03-15 06:32:05] [INFO] connecting to database "host=node2 dbname=repmgr user=repmgr connect_timeout=2"
|
|
INFO: set_repmgrd_pid(): provided pidfile is /var/run/repmgr/repmgrd-11.pid
|
|
[2019-03-15 06:32:05] [NOTICE] starting monitoring of node "node2" (ID: 2)
|
|
[2019-03-15 06:32:05] [INFO] monitoring connection to upstream node "node1" (ID: 1)</programlisting>
|
|
</para>
|
|
<para>
|
|
Each &repmgrd; should also have recorded its successful startup as an event:
|
|
<programlisting>
|
|
$ repmgr -f /etc/repmgr.conf cluster event --event=repmgrd_start
|
|
Node ID | Name | Event | OK | Timestamp | Details
|
|
---------+-------+---------------+----+---------------------+--------------------------------------------------------
|
|
3 | node3 | repmgrd_start | t | 2019-03-14 04:17:30 | monitoring connection to upstream node "node1" (ID: 1)
|
|
2 | node2 | repmgrd_start | t | 2019-03-14 04:11:47 | monitoring connection to upstream node "node1" (ID: 1)
|
|
1 | node1 | repmgrd_start | t | 2019-03-14 04:04:31 | monitoring cluster primary "node1" (ID: 1)</programlisting>
|
|
</para>
|
|
<para>
|
|
Now stop the current primary server with e.g.:
|
|
<programlisting>
|
|
pg_ctl -D /var/lib/postgresql/data -m immediate stop</programlisting>
|
|
</para>
|
|
<para>
|
|
This will force the primary to shut down straight away, aborting all processes
|
|
and transactions. This will cause a flurry of activity in the &repmgrd; log
|
|
files as each &repmgrd; detects the failure of the primary and a failover
|
|
decision is made. This is an extract from the log of a standby server (<literal>node2</literal>)
|
|
which has promoted to new primary after failure of the original primary (<literal>node1</literal>).
|
|
<programlisting>
|
|
[2019-03-15 06:37:50] [WARNING] unable to connect to upstream node "node1" (ID: 1)
|
|
[2019-03-15 06:37:50] [INFO] checking state of node 1, 1 of 3 attempts
|
|
[2019-03-15 06:37:50] [INFO] sleeping 5 seconds until next reconnection attempt
|
|
[2019-03-15 06:37:55] [INFO] checking state of node 1, 2 of 3 attempts
|
|
[2019-03-15 06:37:55] [INFO] sleeping 5 seconds until next reconnection attempt
|
|
[2019-03-15 06:38:00] [INFO] checking state of node 1, 3 of 3 attempts
|
|
[2019-03-15 06:38:00] [WARNING] unable to reconnect to node 1 after 3 attempts
|
|
[2019-03-15 06:38:00] [INFO] primary and this node have the same location ("default")
|
|
[2019-03-15 06:38:00] [INFO] local node's last receive lsn: 0/900CBF8
|
|
[2019-03-15 06:38:00] [INFO] node 3 last saw primary node 12 second(s) ago
|
|
[2019-03-15 06:38:00] [INFO] last receive LSN for sibling node "node3" (ID: 3) is: 0/900CBF8
|
|
[2019-03-15 06:38:00] [INFO] node "node3" (ID: 3) has same LSN as current candidate "node2" (ID: 2)
|
|
[2019-03-15 06:38:00] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
|
|
[2019-03-15 06:38:00] [NOTICE] promotion candidate is "node2" (ID: 2)
|
|
[2019-03-15 06:38:00] [NOTICE] this node is the winner, will now promote itself and inform other nodes
|
|
[2019-03-15 06:38:00] [INFO] promote_command is:
|
|
"/usr/pgsql-11/bin/repmgr -f /etc/repmgr/11/repmgr.conf standby promote"
|
|
NOTICE: promoting standby to primary
|
|
DETAIL: promoting server "node2" (ID: 2) using "/usr/pgsql-11/bin/pg_ctl -w -D '/var/lib/pgsql/11/data' promote"
|
|
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
|
|
NOTICE: STANDBY PROMOTE successful
|
|
DETAIL: server "node2" (ID: 2) was successfully promoted to primary
|
|
[2019-03-15 06:38:01] [INFO] 3 followers to notify
|
|
[2019-03-15 06:38:01] [NOTICE] notifying node "node3" (ID: 3) to follow node 2
|
|
INFO: node 3 received notification to follow node 2
|
|
[2019-03-15 06:38:01] [INFO] switching to primary monitoring mode
|
|
[2019-03-15 06:38:01] [NOTICE] monitoring cluster primary "node2" (ID: 2)</programlisting>
|
|
</para>
|
|
<para>
|
|
The cluster status will now look like this, with the original primary (<literal>node1</literal>)
|
|
marked as inactive, and standby <literal>node3</literal> now following the new primary
|
|
(<literal>node2</literal>):
|
|
<programlisting>
|
|
$ repmgr -f /etc/repmgr.conf cluster show --compact
|
|
ID | Name | Role | Status | Upstream | Location | Prio.
|
|
----+-------+---------+-----------+----------+----------+-------
|
|
1 | node1 | primary | - failed | | default | 100
|
|
2 | node2 | primary | * running | | default | 100
|
|
3 | node3 | standby | running | node2 | default | 100</programlisting>
|
|
|
|
</para>
|
|
<para>
|
|
<link linkend="repmgr-cluster-event"><command>repmgr cluster event</command></link> will display a summary of
|
|
what happened to each server during the failover:
|
|
<programlisting>
|
|
$ repmgr -f /etc/repmgr.conf cluster event
|
|
Node ID | Name | Event | OK | Timestamp | Details
|
|
---------+-------+----------------------------+----+---------------------+-------------------------------------------------------------
|
|
3 | node3 | repmgrd_failover_follow | t | 2019-03-15 06:38:03 | node 3 now following new upstream node 2
|
|
3 | node3 | standby_follow | t | 2019-03-15 06:38:02 | standby attached to upstream node "node2" (ID: 2)
|
|
2 | node2 | repmgrd_reload | t | 2019-03-15 06:38:01 | monitoring cluster primary "node2" (ID: 2)
|
|
2 | node2 | repmgrd_failover_promote | t | 2019-03-15 06:38:01 | node 2 promoted to primary; old primary 1 marked as failed
|
|
2 | node2 | standby_promote | t | 2019-03-15 06:38:01 | server "node2" (ID: 2) was successfully promoted to primary</programlisting>
|
|
</para>
|
|
|
|
</sect1>
|
|
</chapter>
|