Files
repmgr/doc/repmgr-node-check.xml
Ian Barwick 3870768d80 Add --repmgrd option to "repmgr node check"
This provides a simple way for checking whether the node's repmgrd is
running.

GitHub #719.
2021-09-28 09:46:31 +09:00

309 lines
9.3 KiB
XML

<refentry id="repmgr-node-check">
<indexterm>
<primary>repmgr node check</primary>
</indexterm>
<refmeta>
<refentrytitle>repmgr node check</refentrytitle>
</refmeta>
<refnamediv>
<refname>repmgr node check</refname>
<refpurpose>performs some health checks on a node from a replication perspective</refpurpose>
</refnamediv>
<refsect1>
<title>Description</title>
<para>
Performs some health checks on a node from a replication perspective.
This command must be run on the local node.
</para>
<note>
<para>
Currently &repmgr; performs health checks on physical replication
slots only, with the aim of warning about streaming replication standbys which
have become detached and the associated risk of uncontrolled WAL file
growth.
</para>
</note>
</refsect1>
<refsect1>
<title>Example</title>
<para>
Execution on the primary server:
<programlisting>
$ repmgr -f /etc/repmgr.conf node check
Node "node1":
Server role: OK (node is primary)
Replication lag: OK (N/A - node is primary)
WAL archiving: OK (0 pending files)
Upstream connection: OK (N/A - is primary)
Downstream servers: OK (2 of 2 downstream nodes attached)
Replication slots: OK (node has no physical replication slots)
Missing replication slots: OK (node has no missing physical replication slots)
Configured data directory: OK (configured "data_directory" is "/var/lib/postgresql/data")</programlisting>
</para>
<para>
Execution on a standby server:
<programlisting>
$ repmgr -f /etc/repmgr.conf node check
Node "node2":
Server role: OK (node is standby)
Replication lag: OK (0 seconds)
WAL archiving: OK (0 pending archive ready files)
Upstream connection: OK (node "node2" (ID: 2) is attached to expected upstream node "node1" (ID: 1))
Downstream servers: OK (this node has no downstream nodes)
Replication slots: OK (node has no physical replication slots)
Missing physical replication slots: OK (node has no missing physical replication slots)
Configured data directory: OK (configured "data_directory" is "/var/lib/postgresql/data")</programlisting>
</para>
</refsect1>
<refsect1>
<title>Individual checks</title>
<para>
Each check can be performed individually by supplying
an additional command line parameter, e.g.:
<programlisting>
$ repmgr node check --role
OK (node is primary)</programlisting>
</para>
<para>
Parameters for individual checks are as follows:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<option>--role</option>: checks if the node has the expected role
</simpara>
</listitem>
<listitem>
<simpara>
<option>--replication-lag</option>: checks if the node is lagging by more than
<varname>replication_lag_warning</varname> or <varname>replication_lag_critical</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<option>--archive-ready</option>: checks for WAL files which have not yet been archived,
and returns <literal>WARNING</literal> or <literal>CRITICAL</literal> if the number
exceeds <varname>archive_ready_warning</varname> or <varname>archive_ready_critical</varname> respectively.
</simpara>
</listitem>
<listitem>
<simpara>
<option>--downstream</option>: checks that the expected downstream nodes are attached
</simpara>
</listitem>
<listitem>
<simpara>
<option>--upstream</option>: checks that the node is attached to its expected upstream
</simpara>
</listitem>
<listitem>
<simpara>
<option>--slots</option>: checks there are no inactive physical replication slots
</simpara>
</listitem>
<listitem>
<simpara>
<option>--missing-slots</option>: checks there are no missing physical replication slots
</simpara>
</listitem>
<listitem>
<simpara>
<option>--data-directory-config</option>: checks the data directory configured in
<filename>repmgr.conf</filename> matches the actual data directory.
This check is not directly related to replication, but is useful to verify &repmgr;
is correctly configured.
</simpara>
</listitem>
</itemizedlist>
</para>
</refsect1>
<refsect1>
<title>repmgrd</title>
<para>
A separate check is available to verify whether &repmgrd; is running,
This is not included in the general output, as this does not
per-se constitute a check of the node's replication status.
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<option>--repmgrd</option>: checks whether &repmgrd; is running.
If &repmgrd; is running but paused, status <literal>1</literal>
(<literal>WARNING</literal>) is returned.
</simpara>
</listitem>
</itemizedlist>
</refsect1>
<refsect1>
<title>Additional checks</title>
<para>
Several checks are provided for diagnostic purposes and are not
included in the general output:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<option>--db-connection</option>: checks if &repmgr; can connect to the
database on the local node.
</simpara>
<simpara>
This option is particularly useful in combination with <command>SSH</command>, as
it can be used to troubleshoot connection issues encountered when &repmgr; is
executed remotely (e.g. during a switchover operation).
</simpara>
</listitem>
<listitem>
<simpara>
<option>--replication-config-owner</option>: checks if the file containing replication
configuration (PostgreSQL 12 and later: <filename>postgresql.auto.conf</filename>;
PostgreSQL 11 and earlier: <filename>recovery.conf</filename>) is
owned by the same user who owns the data directory.
</simpara>
<simpara>
Incorrect ownership of these files (e.g. if they are owned by <literal>root</literal>)
will cause operations which need to update the replication configuration
(e.g. <link linkend="repmgr-standby-follow"><command>repmgr standby follow</command></link>
or <link linkend="repmgr-standby-promote"><command>repmgr standby promote</command></link>)
to fail.
</simpara>
</listitem>
</itemizedlist>
</para>
</refsect1>
<refsect1>
<title>Connection options</title>
<para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<option>-S</option>/<option>--superuser</option>: connect as the
named superuser instead of the &repmgr; user
</simpara>
</listitem>
</itemizedlist>
</para>
</refsect1>
<refsect1>
<title>Output format</title>
<para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<option>--csv</option>: generate output in CSV format (not available
for individual checks)
</simpara>
</listitem>
<listitem>
<simpara>
<option>--nagios</option>: generate output in a Nagios-compatible format
(for individual checks only)
</simpara>
</listitem>
</itemizedlist>
</para>
</refsect1>
<refsect1>
<title>Exit codes</title>
<para>
When executing <command>repmgr node check</command> with one of the individual
checks listed above, &repmgr; will emit one of the following Nagios-style exit codes
(even if <option>--nagios</option> is not supplied):
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<literal>0</literal>: OK
</simpara>
</listitem>
<listitem>
<simpara>
<literal>1</literal>: WARNING
</simpara>
</listitem>
<listitem>
<simpara>
<literal>2</literal>: ERROR
</simpara>
</listitem>
<listitem>
<simpara>
<literal>3</literal>: UNKNOWN
</simpara>
</listitem>
</itemizedlist>
</para>
<para>
One of the following exit codes will be emitted by <command>repmgr status check</command>
if no individual check was specified.
</para>
<variablelist>
<varlistentry>
<term><option>SUCCESS (0)</option></term>
<listitem>
<para>
No issues were detected.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><option>ERR_NODE_STATUS (25)</option></term>
<listitem>
<para>
One or more issues were detected.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1>
<title>See also</title>
<para>
<xref linkend="repmgr-node-status"/>, <xref linkend="repmgr-cluster-show"/>
</para>
</refsect1>
</refentry>