In a few places, replication connections are generated from the
parameters used by existing connections. This has resulted in a
number of similar blocks of code which do more-or-less the same
thing almost but not quite identically. In two cases, the code
omitted to set "dbname=replication", which can cause problems
in some contexts.
These code blocks have now been consolidated into standardized
functions.
This also resolves the issue addressed by GitHub #619.
Within a PostgreSQL data directory, all files should have the same
ownership as the data directory itself. PostgreSQL itself expects
this, and ownership of files by another user is likely to cause
problems.
In PostgreSQL 11 or earlier, if "recovery.conf" cannot be moved
by PostgreSQL (because e.g. it is owned by root), it will not be
possible to promote the standby to primary.
In PostgreSQL 12 and later, if "postgresql.auto.conf" on the demotion
candidate (current primary) has incorrect ownership (e.g. owned by
root), repmgr will very likely not be able to modify this file and
write the replication configuration required for the node to rejoin
the cluster as a standby.
Checks added to catch both cases before a switchover is executed.
"standby clone --recovery-conf-only" still mentioned "recovery.conf" in a
couple of places; change that to the more generic "replication configuration"
for Pg 12 and later.
In a couple of places ("cluster show" and "service status") we printed
connection status errors as a "WARNING" to stdout, followed by a log
HINT, but the latter was ineffective unless the configured log
level happened to match the level of the most recently emitted log
line (which will most likely be DEBUG).
Convert the printed WARNING lines to an actual log WARNING, to make the
behaviour of this output behave as expected.
This is for backbranches to prevent them running against newer
PostgreSQL versions with which they are not compatible, for example
4.4.x with PostgreSQL 12 and later.
Check the demotion candidate's registered repmgr.conf file can be found.
If the configuration file has been deleted or moved, previously the
resulting error message would have been a confusing reference to
an incorrectly configured data directory; by explicitly checking for the
expected configuration file, we can make troubleshooting easier.
Original patch by laixiong <yin.zhb@gmail.com> (GitHub #615), modified
by Ian Barwick.
Handle corner case where standby (demotion candidate) doesn't
attach during the main check cycle, but does at the final check,
where we'll want to mark the operation as successful.
There were some corner cases where "repmgr standby switchover"
would erroneously report a successful switchover, even if the
demotion candidate had not reattached to the promotion candidate.
Also improve the logging in various places to make it clearer
what is happening on which node.
Make the code previously only used by "standby follow" generally
available - we'll want to use this from "node rejoin" as well.
While we're at it, when reporting failure due to lack of free
replication slots, report the current value of "max_replication_slots".
Usually repmgrd requires the parameters "promote_command" and
"follow_command" to be present in the configuration file. These are
not required if "failover=manual", but the configuration sanity check
following receipt of SIGHUP was not checking that.
Addresses issue reported in GitHub #614.
An attempt will be made to delete an existing replication slot on the
old upstream node (this is important during e.g. a switchover operation
or when attaching a cascaded standby to a new upstream). However if the
standby is currently attached to the follow target node anyway, the
replication slot should never be deleted.