mirror of
https://github.com/EnterpriseDB/repmgr.git
synced 2026-03-22 22:56:29 +00:00
docs: finalize conversion of existing BDR repmgr documentation
This commit is contained in:
@@ -13,7 +13,7 @@
|
||||
<title>repmgr user permissions</title>
|
||||
<para>
|
||||
&repmgr; will create an extension in the BDR database containing objects
|
||||
for administering &repmgr; metadata. The user defined in the `conninfo`
|
||||
for administering &repmgr; metadata. The user defined in the <varname>conninfo</varname>
|
||||
setting must be able to access all objects. Additionally, superuser permissions
|
||||
are required to install the &repmgr; extension. The easiest way to do this
|
||||
is create the &repmgr; user as a superuser, however if this is not
|
||||
|
||||
@@ -167,5 +167,232 @@
|
||||
chronological order).
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1 id="bdr-event-notification-command" xreflabel="BDR failover event notification command">
|
||||
<title>Defining the "event_notification_command"</title>
|
||||
<para>
|
||||
Key to "failover" execution is the <literal>event_notification_command</literal>,
|
||||
which is a user-definable script specified in <filename>repmpgr.conf</filename>
|
||||
and which should reconfigure the proxy server/ connection pooler to point
|
||||
to the other, still-active node.
|
||||
</para>
|
||||
<para>
|
||||
Each time &repmgr; (or <application>repmgrd</application>) records an event,
|
||||
it can optionally execute the script defined in
|
||||
<literal>event_notification_command</literal> to take further action;
|
||||
details of the event will be passed as parameters.
|
||||
</para>
|
||||
<para>
|
||||
Following placeholders are available to the script:
|
||||
</para>
|
||||
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
<term><option>%n</option></term>
|
||||
<listitem>
|
||||
<para>
|
||||
node ID
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><option>%e</option></term>
|
||||
<listitem>
|
||||
<para>
|
||||
event type
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><option>%t</option></term>
|
||||
<listitem>
|
||||
<para>
|
||||
success (1 or 0)
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><option>%t</option></term>
|
||||
<listitem>
|
||||
<para>
|
||||
timestamp
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><option>%d</option></term>
|
||||
<listitem>
|
||||
<para>
|
||||
details
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
|
||||
<para>
|
||||
Note that <literal>%c</literal> and <literal>%a</literal> will only be provided during
|
||||
<varname>bdr_failover</varname> events, which is what is of interest here.
|
||||
</para>
|
||||
<para>
|
||||
The provided sample script (`scripts/bdr-pgbouncer.sh`) is configured like
|
||||
this:
|
||||
<programlisting>
|
||||
event_notification_command='/path/to/bdr-pgbouncer.sh %n %e %s "%c" "%a"'</programlisting>
|
||||
</para>
|
||||
<para>
|
||||
and parses the configures parameters like this:
|
||||
<programlisting>
|
||||
NODE_ID=$1
|
||||
EVENT_TYPE=$2
|
||||
SUCCESS=$3
|
||||
NEXT_CONNINFO=$4
|
||||
NEXT_NODE_NAME=$5</programlisting>
|
||||
</para>
|
||||
<para>
|
||||
The script also contains some hard-coded values about the <application>PgBouncer</application>
|
||||
configuration for both nodes; these will need to be adjusted for your local environment
|
||||
(ideally the scripts would be maintained as templates and generated by some
|
||||
kind of provisioning system).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The script performs following steps:
|
||||
<itemizedlist spacing="compact" mark="bullet">
|
||||
<listitem>
|
||||
<simpara>pauses <application>PgBouncer</application> on all nodes</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>recreates the <application>PgBouncer</application> configuration file on each
|
||||
node using the information provided by <application>repmgrd</application>
|
||||
(primarily the <varname>conninfo</varname> string) to configure
|
||||
<application>PgBouncer</application></simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>reloads the <application>PgBouncer</application> configuration</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>executes the <command>RESUME</command> command (in <application>PgBouncer</application>)</simpara>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
<para>
|
||||
Following successful script execution, any connections to PgBouncer on the failed BDR node
|
||||
will be redirected to the active node.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1 id="bdr-monitoring-failover" xreflabel="Node monitoring and failover">
|
||||
<title>Node monitoring and failover</title>
|
||||
<para>
|
||||
At the intervals specified by <varname>monitor_interval_secs</varname>
|
||||
in <filename>repmgr.conf</filename>, <application>repmgrd</application>
|
||||
will ping each node to check if it's available. If a node isn't available,
|
||||
<application>repmgrd</application> will enter failover mode and check <varname>reconnect_attempts</varname>
|
||||
times at intervals of <varname>reconnect_interval</varname> to confirm the node is definitely unreachable.
|
||||
This buffer period is necessary to avoid false positives caused by transient
|
||||
network outages.
|
||||
</para>
|
||||
<para>
|
||||
If the node is still unavailable, <application>repmgrd</application> will enter failover mode and execute
|
||||
the script defined in <varname>event_notification_command</varname>; an entry will be logged
|
||||
in the <literal>repmgr.events</literal> table and <application>repmgrd</application> will
|
||||
(unless otherwise configured) resume monitoring of the node in "degraded" mode until it reappears.
|
||||
</para>
|
||||
<para>
|
||||
<application>repmgrd</application> logfile output during a failover event will look something like this
|
||||
on one node (usually the node which has failed, here <literal>node2</literal>):
|
||||
<programlisting>
|
||||
...
|
||||
[2017-07-27 21:08:39] [INFO] starting continuous BDR node monitoring
|
||||
[2017-07-27 21:08:39] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
|
||||
[2017-07-27 21:08:55] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
|
||||
[2017-07-27 21:09:11] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
|
||||
[2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
|
||||
[2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
|
||||
[2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||
[2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
|
||||
[2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||
[2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
|
||||
[2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||
[2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
|
||||
[2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||
[2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
|
||||
[2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||
[2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
|
||||
[2017-07-27 21:09:28] [NOTICE] setting node record for node 2 to inactive
|
||||
[2017-07-27 21:09:28] [INFO] executing notification command for event "bdr_failover"
|
||||
[2017-07-27 21:09:28] [DETAIL] command is:
|
||||
/path/to/bdr-pgbouncer.sh 2 bdr_failover 1 "host=host=node1 dbname=bdrtest user=repmgr connect_timeout=2" "node1"
|
||||
[2017-07-27 21:09:28] [INFO] node 'node2' (ID: 2) detected as failed; next available node is 'node1' (ID: 1)
|
||||
[2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
|
||||
[2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
|
||||
...</programlisting>
|
||||
</para>
|
||||
<para>
|
||||
Output on the other node (<literal>node1</literal>) during the same event will look like this:
|
||||
<programlisting>
|
||||
...
|
||||
[2017-07-27 21:08:35] [INFO] starting continuous BDR node monitoring
|
||||
[2017-07-27 21:08:35] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
|
||||
[2017-07-27 21:08:51] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
|
||||
[2017-07-27 21:09:07] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
|
||||
[2017-07-27 21:09:23] [WARNING] unable to connect to node node2 (ID 2)
|
||||
[2017-07-27 21:09:23] [INFO] checking state of node 2, 0 of 5 attempts
|
||||
[2017-07-27 21:09:23] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||
[2017-07-27 21:09:24] [INFO] checking state of node 2, 1 of 5 attempts
|
||||
[2017-07-27 21:09:24] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||
[2017-07-27 21:09:25] [INFO] checking state of node 2, 2 of 5 attempts
|
||||
[2017-07-27 21:09:25] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||
[2017-07-27 21:09:26] [INFO] checking state of node 2, 3 of 5 attempts
|
||||
[2017-07-27 21:09:26] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||
[2017-07-27 21:09:27] [INFO] checking state of node 2, 4 of 5 attempts
|
||||
[2017-07-27 21:09:27] [INFO] sleeping 1 seconds until next reconnection attempt
|
||||
[2017-07-27 21:09:28] [WARNING] unable to reconnect to node 2 after 5 attempts
|
||||
[2017-07-27 21:09:28] [NOTICE] other node's repmgrd is handling failover
|
||||
[2017-07-27 21:09:28] [INFO] monitoring BDR replication status on node "node1" (ID: 1)
|
||||
[2017-07-27 21:09:28] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
|
||||
...</programlisting>
|
||||
</para>
|
||||
<para>
|
||||
This assumes only the PostgreSQL instance on <literal>node2</literal> has failed. In this case the
|
||||
<application>repmgrd</application> instance running on <literal>node2</literal> has performed the failover. However if
|
||||
the entire server becomes unavailable, <application>repmgrd</application> on <literal>node1</literal> will perform
|
||||
the failover.
|
||||
</para>
|
||||
</sect1>
|
||||
<sect1 id="bdr-node-recovery" xreflabel="Node recovery">
|
||||
<title>Node recovery</title>
|
||||
<para>
|
||||
Following failure of a BDR node, if the node subsequently becomes available again,
|
||||
a <varname>bdr_recovery</varname> event will be generated. This could potentially be used to
|
||||
reconfigure PgBouncer automatically to bring the node back into the available pool,
|
||||
however it would be prudent to manually verify the node's status before
|
||||
exposing it to the application.
|
||||
</para>
|
||||
<para>
|
||||
If the failed node comes back up and connects correctly, output similar to this
|
||||
will be visible in the <application>repmgrd</application> log:
|
||||
<programlisting>
|
||||
[2017-07-27 21:25:30] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
|
||||
[2017-07-27 21:25:46] [INFO] monitoring BDR replication status on node "node2" (ID: 2)
|
||||
[2017-07-27 21:25:46] [DETAIL] monitoring node "node2" (ID: 2) in degraded mode
|
||||
[2017-07-27 21:25:55] [INFO] active replication slot for node "node1" found after 1 seconds
|
||||
[2017-07-27 21:25:55] [NOTICE] node "node2" (ID: 2) has recovered after 986 seconds</programlisting>
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1 id="bdr-complete-shutdown" xreflabel="Shutdown of both nodes">
|
||||
<title>Shutdown of both nodes</title>
|
||||
<para>
|
||||
If both PostgreSQL instances are shut down, <application>repmgrd</application> will try and handle the
|
||||
situation as gracefully as possible, though with no failover candidates available
|
||||
there's not much it can do. Should this case ever occur, we recommend shutting
|
||||
down <application>repmgrd</application> on both nodes and restarting it once the PostgreSQL instances
|
||||
are running properly.
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
|
||||
Reference in New Issue
Block a user