Previously the check verifying that a node has connected to its upstream
merely assumed the presence of a record in pg_stat_replication indicates
a successful replication connection. However the record may contain a
state other than "streaming", typically "startup" (which will occur when
a node has diverged from its upstream and will therefore never
transition to "streaming"), which needs to be taken into account when
considering the state of the replication connection to avoid false
positives.
repmgr creates a file with a list of tablespace files to fetch from
Barman, however the file may not actually have been flushed to disk
at the point the rsync operation was executed, so may be incomplete
or empty.
Also fix handling of tablespace remapping.
Addresses GitHub #650.
We omitted to do this with the connections used when checking the system
identifier, which means libpq calls by the teardown function using the
pointer risk using unallocated memory.
Addresses issue reported in GitHub #644.
Add a sanity check that rempgr, when remotely executed on the demotion
candidate, is able to connect as superuser. If not, emit a diagnostic
command as a hint.
It's useful to have a confirmation of which database repmgr is trying
to connect to when the -S/--superuser connection is provided.
It will always be the database defined in the repmgr.conf "conninfo"
parameter, but having the name available is useful when e.g.
troubleshooting issues with .pgpass configuration.
Output a command, which when excuted on the local node (promotion
candidate) will attempt to remotely connect to the demotion candidate
and display both the connection message encountered and the connection
parameters used.
This is useful for corner-cases where the connection normally succeeds if a
particular environment variable (e.g. PGPORT) is normally set, but is
not set in the environment where SSH is executed.
Explicitly log if a database connection failure caused the check
to fail.
It's unlikely this situation will be encountered, as the data directory
check will already have run and checked for connection failure, however
there's a small chance the connection could fail between checks.
It's possible that the remote data directory check will fail if e.g.
connection configuration is not consistent across all nodes. This
modification ensures a database error connection is reported, rather
than a spurios issue with the data directory configuration.
Commit 0574279 set the file permissions to 0600 rather than the user's
umask, but if initdb was executed with -g/--allow-group-access, the
file is maintained with 0640, so we'll just maintain the existing
permssions.
From PostgreSQL 12, the SQL-level function "pg_promote()" can be used
to promote a PostgreSQL instance, however usage is restricted to
superusers and users to whom explicit execution permission for this
function has been granted.
Therefore, if execution permission is not available, fall back to
"pg_ctl promote".
In a few places, replication connections are generated from the
parameters used by existing connections. This has resulted in a
number of similar blocks of code which do more-or-less the same
thing almost but not quite identically. In two cases, the code
omitted to set "dbname=replication", which can cause problems
in some contexts.
These code blocks have now been consolidated into standardized
functions.
This also resolves the issue addressed by GitHub #619.
Within a PostgreSQL data directory, all files should have the same
ownership as the data directory itself. PostgreSQL itself expects
this, and ownership of files by another user is likely to cause
problems.
In PostgreSQL 11 or earlier, if "recovery.conf" cannot be moved
by PostgreSQL (because e.g. it is owned by root), it will not be
possible to promote the standby to primary.
In PostgreSQL 12 and later, if "postgresql.auto.conf" on the demotion
candidate (current primary) has incorrect ownership (e.g. owned by
root), repmgr will very likely not be able to modify this file and
write the replication configuration required for the node to rejoin
the cluster as a standby.
Checks added to catch both cases before a switchover is executed.
"standby clone --recovery-conf-only" still mentioned "recovery.conf" in a
couple of places; change that to the more generic "replication configuration"
for Pg 12 and later.
This is for backbranches to prevent them running against newer
PostgreSQL versions with which they are not compatible, for example
4.4.x with PostgreSQL 12 and later.
Check the demotion candidate's registered repmgr.conf file can be found.
If the configuration file has been deleted or moved, previously the
resulting error message would have been a confusing reference to
an incorrectly configured data directory; by explicitly checking for the
expected configuration file, we can make troubleshooting easier.
Original patch by laixiong <yin.zhb@gmail.com> (GitHub #615), modified
by Ian Barwick.
Handle corner case where standby (demotion candidate) doesn't
attach during the main check cycle, but does at the final check,
where we'll want to mark the operation as successful.
There were some corner cases where "repmgr standby switchover"
would erroneously report a successful switchover, even if the
demotion candidate had not reattached to the promotion candidate.
Also improve the logging in various places to make it clearer
what is happening on which node.
Make the code previously only used by "standby follow" generally
available - we'll want to use this from "node rejoin" as well.
While we're at it, when reporting failure due to lack of free
replication slots, report the current value of "max_replication_slots".
An attempt will be made to delete an existing replication slot on the
old upstream node (this is important during e.g. a switchover operation
or when attaching a cascaded standby to a new upstream). However if the
standby is currently attached to the follow target node anyway, the
replication slot should never be deleted.
Enable operations which create or drop replication slots to be carried
out with the minimum necessary user permissions, i.e. a user with the
REPLICATION attribute.
This can be the repmgr user, or a dedicated replication user.
In the latter case, if the dedicated replication user is only
permitted to make replication connections, the streaming
replication protocol is used to create/drop slots.
Implements part of GitHub #536.