In some circumstances, e.g. while performing a switchover, it is essential
that repmgrd does not take any kind of failover action, as this will put
the cluster into an incorrect state.
Previously it was necessary to stop repmgrd on all nodes (or at least
those nodes which repmgrd would consider as promotion candidates), however
this is a cumbersome and potentially risk-prone operation, particularly if the
replication cluster contains more than a couple of servers.
To prevent this issue from occurring, this patch introduces the ability
to "pause" repmgrd on all nodes wth a single command ("repmgr daemon pause")
which notifies repmgrd not to take any failover action until the node
is "unpaused" ("repmgr daemon unpause").
"repmgr daemon status" provides an overview of each node and whether repmgrd
is running, and if so whether it is paused.
"repmgr standby switchover" has been modified to automatically pause repmgrd
while carrying out the switchover.
See documentation for further details.
If any issues are detected (e.g. node not reachable, unexpected node status
etc.), "repmgr cluster show" returns exit code 25 ("ERR_NODE_STATUS").
Note that exit code 25 was introduced recently as "ERR_CLUSTER_CHECK",
however it makes sense to use this to indicate issues detected by any
command which can detect node issues.
Addresses GitHub #456.
Previously repmgr was relying on whatever command was configured to
start PostgreSQL to determine whether the node being rejoined had
started correctly. However it's preferable to actively poll the upstream
to confirm it has restarted and actually attached as a standby before
confirming success of the "node rejoin" action.
This can be overridden with the -W/--no-wait option.
(Note that for consistency with other PostgreSQL utilities, the
short form of the --wait option is now "-w"; this is currently
only used in "repmgr standby follow".)
Also update "repmgr node rejoin" documentation with a list of supported
options, and add some useful index entries for "pg_rewind".
Implements GitHub #415.
Previously, if an issue was encountered with the old primary, but user
provided -F/--force to have repmgr promote the standby anyway, repmgr
would exit with the log message "STANDBY SWITCHOVER is complete"
and exit code 0 (SUCCESS).
To better report this partial completion, repmgr will now emit the message
"STANDBY SWITCHOVER has completed with issues" (and a HINT to check preceding
log messages) and new exit code 22 (ERR_SWITCHOVER_INCOMPLETE).
The repmgr3 implementation required the promotion candidate (standby)
to directly work with the demotion candidate's data directory,
directly execute server control commands etc.
Here we delegated a lot more of that work to the repmgr on the
demotion candidate, which reduces the amount of back-and-forth
over SSH and generally makes things cleaner and smoother.
In particular the repmgr on the demotion candidate will carry
out a thorough check that the node is shut down and report
the last checkpoint LSN to the promotion candidate; this
can then be used to determine whether pg_rewind needs to be
executed on the demoted primary before reintegrating it back
into the cluster (todo).
Also implement "--dry-run" for this action, which will sanity-check the
nodes as far as possible without executing the switchover.
Additionally some of the new repmgr node commands (or command options)
introduced for this can be also executed by the user to obtain
additional information about the status of each node.
In some cases, the monitored upstream may not be available for a while
(e.g. network split), in which case it makes sense to have repmgrd
keep running and trying to reconnect. Previously it would just keel
over and quit.