mirror of
https://github.com/EnterpriseDB/repmgr.git
synced 2026-03-22 22:56:29 +00:00
If repmgrd is running in degraded mode on a primary which has been stopped, then manually been brought back online as a standby (e.g. by creating recovery.conf and starting the server), ensure it not only detects the change but automatically updates the node record so it can resume monitoring the node as a standby. Previously, repmgrd was looping waiting for the record to be updated (as is done transparently when executing "repmgr node rejoin") but if the record was not updated within the timeout period (e.g. by "repmgr standby register) it would fail to resume monitoring as a standby. It seems reasonable to have repmgrd automatically update the node record, as this will restore failover capability as quickly as possible. If this is not desired, then the onus is on the user to shut down repmgrd while making the desired changes.
84 lines
3.9 KiB
Plaintext
84 lines
3.9 KiB
Plaintext
<chapter id="repmgrd-degraded-monitoring">
|
|
<indexterm>
|
|
<primary>repmgrd</primary>
|
|
<secondary>degraded monitoring</secondary>
|
|
</indexterm>
|
|
|
|
<title>"degraded monitoring" mode</title>
|
|
<para>
|
|
In certain circumstances, <application>repmgrd</application> is not able to fulfill its primary mission
|
|
of monitoring the nodes' upstream server. In these cases it enters "degraded
|
|
monitoring" mode, where <application>repmgrd</application> remains active but is waiting for the situation
|
|
to be resolved.
|
|
</para>
|
|
<para>
|
|
Situations where this happens are:
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
<simpara>a failover situation has occurred, no nodes in the primary node's location are visible</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>a failover situation has occurred, but no promotion candidate is available</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>a failover situation has occurred, but the promotion candidate could not be promoted</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>a failover situation has occurred, but the node was unable to follow the new primary</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>a failover situation has occurred, but no primary has become available</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>a failover situation has occurred, but automatic failover is not enabled for the node</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>repmgrd is monitoring the primary node, but it is not available (and no other node has been promoted as primary)</simpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
Example output in a situation where there is only one standby with <literal>failover=manual</literal>,
|
|
and the primary node is unavailable (but is later restarted):
|
|
<programlisting>
|
|
[2017-08-29 10:59:19] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled)
|
|
[2017-08-29 10:59:33] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
|
|
[2017-08-29 10:59:33] [INFO] checking state of node 1, 1 of 5 attempts
|
|
[2017-08-29 10:59:33] [INFO] sleeping 1 seconds until next reconnection attempt
|
|
(...)
|
|
[2017-08-29 10:59:37] [INFO] checking state of node 1, 5 of 5 attempts
|
|
[2017-08-29 10:59:37] [WARNING] unable to reconnect to node 1 after 5 attempts
|
|
[2017-08-29 10:59:37] [NOTICE] this node is not configured for automatic failover so will not be considered as promotion candidate
|
|
[2017-08-29 10:59:37] [NOTICE] no other nodes are available as promotion candidate
|
|
[2017-08-29 10:59:37] [HINT] use "repmgr standby promote" to manually promote this node
|
|
[2017-08-29 10:59:37] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled)
|
|
[2017-08-29 10:59:53] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled)
|
|
[2017-08-29 11:00:45] [NOTICE] reconnected to upstream node 1 after 68 seconds, resuming monitoring
|
|
[2017-08-29 11:00:57] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled)</programlisting>
|
|
|
|
</para>
|
|
<para>
|
|
By default, <literal>repmgrd</literal> will continue in degraded monitoring mode indefinitely.
|
|
However a timeout (in seconds) can be set with <varname>degraded_monitoring_timeout</varname>,
|
|
after which <application>repmgrd</application> will terminate.
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
If <application>repmgrd</application> is monitoring a primary mode which has been stopped
|
|
and manually restarted as a standby attached to a new primary, it will automatically detect
|
|
the status change and update the node record to reflect the node's new status
|
|
as an active standby. It will then resume monitoring the node as a standby.
|
|
</para>
|
|
</note>
|
|
|
|
</chapter>
|