As that's what we really want to know. Also return "UNCLEAN_SHUTDOWN"
if that's the case, rather than "RUNNING" which is confusing, even
though it's a command for internal use.
In an automatic failover situation, after a standby has been promoted
there's a risk the original primary may become available again before
"standby follow" is issued on another standby node, in which case "standby
follow" will reconnect to the original primary.
As the standby's repmgrd will have received a notification from the new
primary, it will know the primary's ID and can therefore explicitly
direct "standby follow" to follow that primary.
It's possible some distribution packages may assign a different name to the
"repmgr" binary (typically appending a version number), so we shouldn't hard-code
into the command string.
However it's reasonable to assume the "repmgr" binary will have the same name
across a replication cluster so we won't engage in any contortions to account
for possible variations.
See: https://github.com/2ndQuadrant/repmgr/issues/323
This is a remnant of the early repmgr days when there were no alternative
mechanisms for ensuring sufficient WAL remains available while cloning a
standby.
The purpose of this setting was to override a check for an (arbitrary)
minimum setting for "wal_keep_segments". As there's no reliable way
of determining a sensible value for this, and improvements in
pg_basebackup mean WALs can be streamed (possibly using a replication
slot) while the backup is in progress, there's no point in keeping
this around.
We will however still emit a warning about setting "wal_keep_segments"
if the configuration doesn't appear to provide any other way of
ensuring WAL is available during/after the cloning process and
"wal_keep_segments" is not set.
There are now too many options to sensibly fit into general --help
output; we'll add separate output for each repmgr command, e.g.
"repmgr node --help".
Previously repmgr would write all the default libpq parameters
into "primary_conninfo" on "standby clone", but not for
"standby follow", which is inconsistent.
For repmgr4 we'll determine that the upstream node's conninfo
must be canonical and contain all required connection parameters,
even if these are available as defaults or environment variables
in the local environment, as those are transient and may not
be available in all environments/situations.
recovery.conf's "primary_conninfo" will be generated using the
upstream's conninfo parameters, except for those specific
to the downstream node. These are:
- "application_name": this will always be set to the
"node_name" of the downstream node
- "passfile" and "servicefile": these, must of course
reference files on the downstream node so will be extracted
from the downstream node's conninfo, if set
When executing repmgr on remote nodes, we otherwise end up jumping
through hoops as we can't make assumptions about where the configuration
file is located, but really need to be able to provide it.
From a support point of view it will also make life easier as it will
be easy to specify exactly which file to provide.
"standby follow" was originally co-opted to start up a demoted node;
this functionality is now delegated to "node rejoin", with the core
functionality of "standby follow" implemented as an internal function.
If the current primary (demotion candidate) still has any files to archive,
it will delay the shutdown until all files are archived. If there is a
substantial number of files, and/or the archive command executes slowly,
this will probably lead to an unwelcome delay in the switchover process.
The repmgr3 implementation required the promotion candidate (standby)
to directly work with the demotion candidate's data directory,
directly execute server control commands etc.
Here we delegated a lot more of that work to the repmgr on the
demotion candidate, which reduces the amount of back-and-forth
over SSH and generally makes things cleaner and smoother.
In particular the repmgr on the demotion candidate will carry
out a thorough check that the node is shut down and report
the last checkpoint LSN to the promotion candidate; this
can then be used to determine whether pg_rewind needs to be
executed on the demoted primary before reintegrating it back
into the cluster (todo).
Also implement "--dry-run" for this action, which will sanity-check the
nodes as far as possible without executing the switchover.
Additionally some of the new repmgr node commands (or command options)
introduced for this can be also executed by the user to obtain
additional information about the status of each node.
There are some circumstances, e.g. during switchover operations,
where repmgr may need to operate on a data directory while the
server isn't running, in which case there's no way to retrieve
that information.
Also add configuration file option "pgdata" for hard-coding the
node's data directory - if the "repmgr" DB user isn't a superuser
or doesn't have permission to extract the data directory, we'll
need another way of finding out.
This is needed for better switchover control, so we can instruct
the remote repmgr to issue the appropriate server command rather
than trying to work out what it should be from the local node.