mirror of
https://github.com/EnterpriseDB/repmgr.git
synced 2026-03-23 07:06:30 +00:00
Previously repmgr was relying on whatever command was configured to start PostgreSQL to determine whether the node being rejoined had started correctly. However it's preferable to actively poll the upstream to confirm it has restarted and actually attached as a standby before confirming success of the "node rejoin" action. This can be overridden with the -W/--no-wait option. (Note that for consistency with other PostgreSQL utilities, the short form of the --wait option is now "-w"; this is currently only used in "repmgr standby follow".) Also update "repmgr node rejoin" documentation with a list of supported options, and add some useful index entries for "pg_rewind". Implements GitHub #415.
324 lines
14 KiB
Plaintext
324 lines
14 KiB
Plaintext
<chapter id="performing-switchover" xreflabel="Performing a switchover with repmgr">
|
|
|
|
<indexterm>
|
|
<primary>switchover</primary>
|
|
</indexterm>
|
|
|
|
<title>Performing a switchover with repmgr</title>
|
|
<para>
|
|
A typical use-case for replication is a combination of primary and standby
|
|
server, with the standby serving as a backup which can easily be activated
|
|
in case of a problem with the primary. Such an unplanned failover would
|
|
normally be handled by promoting the standby, after which an appropriate
|
|
action must be taken to restore the old primary.
|
|
</para>
|
|
<para>
|
|
In some cases however it's desirable to promote the standby in a planned
|
|
way, e.g. so maintenance can be performed on the primary; this kind of switchover
|
|
is supported by the <xref linkend="repmgr-standby-switchover"> command.
|
|
</para>
|
|
<para>
|
|
<command>repmgr standby switchover</command> differs from other &repmgr;
|
|
actions in that it also performs actions on another server (the demotion
|
|
candidate), which means passwordless SSH access is required to that server
|
|
from the one where <command>repmgr standby switchover</command> is executed.
|
|
</para>
|
|
<note>
|
|
<simpara>
|
|
<command>repmgr standby switchover</command> performs a relatively complex
|
|
series of operations on two servers, and should therefore be performed after
|
|
careful preparation and with adequate attention. In particular you should
|
|
be confident that your network environment is stable and reliable.
|
|
</simpara>
|
|
<simpara>
|
|
Additionally you should be sure that the current primary can be shut down
|
|
quickly and cleanly. In particular, access from applications should be
|
|
minimalized or preferably blocked completely. Also be aware that if there
|
|
is a backlog of files waiting to be archived, PostgreSQL will not shut
|
|
down until archiving completes.
|
|
</simpara>
|
|
<simpara>
|
|
We recommend running <command>repmgr standby switchover</command> at the
|
|
most verbose logging level (<literal>--log-level=DEBUG --verbose</literal>)
|
|
and capturing all output to assist troubleshooting any problems.
|
|
</simpara>
|
|
<simpara>
|
|
Please also read carefully the sections <xref linkend="preparing-for-switchover"> and
|
|
<xref linkend="switchover-caveats"> below.
|
|
</simpara>
|
|
</note>
|
|
|
|
<sect1 id="preparing-for-switchover" xreflabel="Preparing for switchover">
|
|
<indexterm>
|
|
<primary>switchover</primary>
|
|
<secondary>preparation</secondary>
|
|
</indexterm>
|
|
<title>Preparing for switchover</title>
|
|
|
|
<para>
|
|
As mentioned in the previous section, success of the switchover operation depends on
|
|
&repmgr; being able to shut down the current primary server quickly and cleanly.
|
|
</para>
|
|
|
|
<para>
|
|
Ensure that a passwordless SSH connection is possible from the promotion candidate
|
|
(standby) to the demotion candidate (current primary). If <literal>--siblings-follow</literal>
|
|
will be used, ensure that passwordless SSH connections are possible from the
|
|
promotion candidate to all standbys attached to the demotion candidate.
|
|
</para>
|
|
|
|
<note>
|
|
<simpara>
|
|
&repmgr; expects to find the &repmgr; binary in the same path on the remote
|
|
server as on the local server.
|
|
</simpara>
|
|
</note>
|
|
|
|
<para>
|
|
Double-check which commands will be used to stop/start/restart the current
|
|
primary; on the primary execute:
|
|
<programlisting>
|
|
repmgr -f /etc/repmgr.conf node service --list --action=stop
|
|
repmgr -f /etc/repmgr.conf node service --list --action=start
|
|
repmgr -f /etc/repmgr.conf node service --list --action=restart</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
These commands can be defined in <filename>repmgr.conf</filename> with
|
|
<option>service_start_command</option>, <option>service_stop_command</option>
|
|
and <option>service_restart_command</option>.
|
|
</para>
|
|
|
|
<important>
|
|
<para>
|
|
If &repmgr; is installed from a package. you should set these commands
|
|
to use the appropriate service commands defined by the package/operating
|
|
system as these will ensure PostgreSQL is stopped/started properly
|
|
taking into account configuration and log file locations etc.
|
|
</para>
|
|
<para>
|
|
If the <option>service_*_command</option> options aren't defined, &repmgr; will
|
|
fall back to using <application>pg_ctl</application> to stop/start/restart
|
|
PostgreSQL, which may not work properly.
|
|
</para>
|
|
</important>
|
|
|
|
<note>
|
|
<simpara>
|
|
On <literal>systemd</literal> systems we strongly recommend using the appropriate
|
|
<command>systemctl</command> commands (typically run via <command>sudo</command>) to ensure
|
|
<literal>systemd</literal> is informed about the status of the PostgreSQL service.
|
|
</simpara>
|
|
<simpara>
|
|
If using <command>sudo</command> for the <command>systemctl</command> calls, make sure the
|
|
<command>sudo</command> specification doesn't require a real tty for the user. If not set
|
|
this way, <command>repmgr</command> will fail to stop the primary.
|
|
</simpara>
|
|
</note>
|
|
|
|
<para>
|
|
Check that access from applications is minimalized or preferably blocked
|
|
completely, so applications are not unexpectedly interrupted.
|
|
</para>
|
|
|
|
<para>
|
|
Check there is no significant replication lag on standbys attached to the
|
|
current primary.
|
|
</para>
|
|
|
|
<para>
|
|
If WAL file archiving is set up, check that there is no backlog of files waiting
|
|
to be archived, as PostgreSQL will not finally shut down until all of these have been
|
|
archived. If there is a backlog exceeding <varname>archive_ready_warning</varname> WAL files,
|
|
&repmgr; will emit a warning before attempting to perform a switchover; you can also check
|
|
manually with <command>repmgr node check --archive-ready</command>.
|
|
</para>
|
|
|
|
<para>
|
|
Ensure that <application>repmgrd</application> is *not* running anywhere to prevent it unintentionally
|
|
promoting a node.
|
|
</para>
|
|
|
|
<para>
|
|
Finally, consider executing <command>repmgr standby switchover</command> with the
|
|
<literal>--dry-run</literal> option; this will perform any necessary checks and inform you about
|
|
success/failure, and stop before the first actual command is run (which would be the shutdown of the
|
|
current primary). Example output:
|
|
<programlisting>
|
|
$ repmgr standby switchover -f /etc/repmgr.conf --siblings-follow --dry-run
|
|
NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode
|
|
INFO: SSH connection to host "node1" succeeded
|
|
INFO: archive mode is "off"
|
|
INFO: replication lag on this standby is 0 seconds
|
|
INFO: all sibling nodes are reachable via SSH
|
|
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
|
|
INFO: following shutdown command would be run on node "node1":
|
|
"pg_ctl -l /var/log/postgresql/startup.log -D '/var/lib/postgresql/data' -m fast -W stop"
|
|
</programlisting>
|
|
</para>
|
|
|
|
<important>
|
|
<para>
|
|
Be aware that <option>--dry-run</option> checks the prerequisites
|
|
for performing the switchover and some basic sanity checks on the
|
|
state of the database which might effect the switchover operation
|
|
(e.g. replication lag); it cannot however guarantee the switchover
|
|
operation will succeed. In particular, if the current primary
|
|
does not shut down cleanly, &repmgr; will not be able to reliably
|
|
execute the switchover (as there would be a danger of divergence
|
|
between the former and new primary nodes).
|
|
</para>
|
|
</important>
|
|
|
|
|
|
<note>
|
|
<simpara>
|
|
See <xref linkend="repmgr-standby-switchover"> for a full list of available
|
|
command line options and <filename>repmgr.conf</filename> settings relevant
|
|
to performing a switchover.
|
|
</simpara>
|
|
</note>
|
|
|
|
<sect2 id="switchover-pg-rewind" xreflabel="Switchover and pg_rewind">
|
|
<indexterm>
|
|
<primary>pg_rewind</primary>
|
|
<secondary>using with "repmgr standby switchover"</secondary>
|
|
</indexterm>
|
|
<title>Switchover and pg_rewind</title>
|
|
<para>
|
|
If the demotion candidate does not shut down smoothly or cleanly, there's a risk it
|
|
will have a slightly divergent timeline and will not be able to attach to the new
|
|
primary. To fix this situation without needing to reclone the old primary, it's
|
|
possible to use the <application>pg_rewind</application> utility, which will usually be
|
|
able to resync the two servers.
|
|
</para>
|
|
<para>
|
|
To have &repmgr; execute <application>pg_rewind</application> if it detects this
|
|
situation after promoting the new primary, add the <option>--force-rewind</option>
|
|
option.
|
|
</para>
|
|
<note>
|
|
<simpara>
|
|
If &repmgr; detects a situation where it needs to execute <application>pg_rewind</application>,
|
|
it will execute a <literal>CHECKPOINT</literal> on the new primary before executing
|
|
<application>pg_rewind</application>.
|
|
</simpara>
|
|
</note>
|
|
<para>
|
|
For more details on <application>pg_rewind</application>, see:
|
|
<ulink url="https://www.postgresql.org/docs/current/static/app-pgrewind.html">https://www.postgresql.org/docs/current/static/app-pgrewind.html</ulink>.
|
|
</para>
|
|
<para>
|
|
<application>pg_rewind</application> has been part of the core PostgreSQL distribution since
|
|
version 9.5. Users of versions 9.3 and 9.4 will need to manually install it; the source code is available here:
|
|
<ulink url="https://github.com/vmware/pg_rewind">https://github.com/vmware/pg_rewind</ulink>.
|
|
If the <application>pg_rewind</application>
|
|
binary is not installed in the PostgreSQL <filename>bin</filename> directory, provide
|
|
its full path on the demotion candidate with <option>--force-rewind</option>.
|
|
</para>
|
|
<para>
|
|
Note that building the 9.3/9.4 version of <application>pg_rewind</application> requires the PostgreSQL
|
|
source code. Also, PostgreSQL 9.3 does not provide <varname>wal_log_hints</varname>,
|
|
meaning data checksums must have been enabled when the database was initialized.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="switchover-execution" xreflabel="Executing the switchover command">
|
|
<indexterm>
|
|
<primary>switchover</primary>
|
|
<secondary>execution</secondary>
|
|
</indexterm>
|
|
<title>Executing the switchover command</title>
|
|
<para>
|
|
To demonstrate switchover, we will assume a replication cluster with a
|
|
primary (<literal>node1</literal>) and one standby (<literal>node2</literal>);
|
|
after the switchover <literal>node2</literal> should become the primary with
|
|
<literal>node1</literal> following it.
|
|
</para>
|
|
<para>
|
|
The switchover command must be run from the standby which is to be promoted,
|
|
and in its simplest form looks like this:
|
|
<programlisting>
|
|
$ repmgr -f /etc/repmgr.conf standby switchover
|
|
NOTICE: executing switchover on node "node2" (ID: 2)
|
|
INFO: searching for primary node
|
|
INFO: checking if node 1 is primary
|
|
INFO: current primary node is 1
|
|
INFO: SSH connection to host "node1" succeeded
|
|
INFO: archive mode is "off"
|
|
INFO: replication lag on this standby is 0 seconds
|
|
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
|
|
NOTICE: stopping current primary node "node1" (ID: 1)
|
|
NOTICE: issuing CHECKPOINT
|
|
DETAIL: executing server command "pg_ctl -l /var/log/postgres/startup.log -D '/var/lib/pgsql/data' -m fast -W stop"
|
|
INFO: checking primary status; 1 of 6 attempts
|
|
NOTICE: current primary has been cleanly shut down at location 0/3001460
|
|
NOTICE: promoting standby to primary
|
|
DETAIL: promoting server "node2" (ID: 2) using "pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' promote"
|
|
server promoting
|
|
NOTICE: STANDBY PROMOTE successful
|
|
DETAIL: server "node2" (ID: 2) was successfully promoted to primary
|
|
INFO: setting node 1's primary to node 2
|
|
NOTICE: starting server using "pg_ctl -l /var/log/postgres/startup.log -w -D '/var/lib/pgsql/data' restart"
|
|
NOTICE: NODE REJOIN successful
|
|
DETAIL: node 1 is now attached to node 2
|
|
NOTICE: switchover was successful
|
|
DETAIL: node "node2" is now primary
|
|
NOTICE: STANDBY SWITCHOVER is complete
|
|
</programlisting>
|
|
</para>
|
|
<para>
|
|
The old primary is now replicating as a standby from the new primary, and the
|
|
cluster status will now look like this:
|
|
<programlisting>
|
|
$ repmgr -f /etc/repmgr.conf cluster show
|
|
ID | Name | Role | Status | Upstream | Location | Connection string
|
|
----+-------+---------+-----------+----------+----------+--------------------------------------
|
|
1 | node1 | standby | running | node2 | default | host=node1 dbname=repmgr user=repmgr
|
|
2 | node2 | primary | * running | | default | host=node2 dbname=repmgr user=repmgr
|
|
</programlisting>
|
|
</para>
|
|
</sect1>
|
|
<sect1 id="switchover-caveats" xreflabel="Caveats">
|
|
<indexterm>
|
|
<primary>switchover</primary>
|
|
<secondary>caveats</secondary>
|
|
</indexterm>
|
|
<title>Caveats</title>
|
|
<para>
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<simpara>
|
|
If using PostgreSQL 9.3 or 9.4, you should ensure that the shutdown command
|
|
is configured to use PostgreSQL's <varname>fast</varname> shutdown mode (the default in 9.5
|
|
and later). If relying on <command>pg_ctl</command> to perform database server operations,
|
|
you should include <literal>-m fast</literal> in <varname>pg_ctl_options</varname>
|
|
in <filename>repmgr.conf</filename>.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
<command>pg_rewind</command> *requires* that either <varname>wal_log_hints</varname> is enabled, or that
|
|
data checksums were enabled when the cluster was initialized. See the
|
|
<ulink url="https://www.postgresql.org/docs/current/static/app-pgrewind.html">pg_rewind documentation</ulink>
|
|
for details.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
<application>repmgrd</application> should not be running with setting <varname>failover=automatic</varname>
|
|
in <filename>repmgr.conf</filename> when a switchover is carried out, otherwise the
|
|
<application>repmgrd</application> daemon may try and promote a standby by itself.
|
|
</simpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
We hope to remove some of these restrictions in future versions of &repmgr;.
|
|
</para>
|
|
</sect1>
|
|
</chapter>
|