If the upstream node has changed, for example after "repmgr standby follow"
was manually executed, restart monitoring to ensure repmgrd is monitoring the
correct node.
Previously, if connection URIs were in use and "repmgr standby switchover"
was executed, repmgr would pass the connection URI as-is to the demotion
candidate to execute "repmgr node rejoin". However the presence of
unescaped ampersands in the connection URI was causing the rejoin command
to be incorrectly executed.
Addresses GitHub #525.
repmgr requires that the same "major version" (e.g. 4.3) is present
on all nodes, otherwise - particularly in the case of repmgrd - it's
highly likely things won't work as expected.
Implements part of GitHub #515.
A CHECKPOINT is not always required; hopefully we can narrow it down
to one corner case where we need to determine the minium recovery
location.
Also get local timeline ID via IDENTIFY_SYSTEM, as fetching it from
pg_control risks returning the prior timeline ID if the timeline
switch has just taken place and no restart point has yet occurred.
Previously it would refuse to start if the primary was not reachable,
the thinking being that it's pointless trying to monitor an incomplete
cluster.
However following an aborted failover situation, repmgrd will restart
monitoring and on the witness server, this will lead to it aborting
itself due to to continuing absence of primary.
To resolve this, witness repmgrd will now start monitoring in degraded
mode if no primary is found in the hope a primary will reappear at
some point.
Most of the time we can simply get the version number directly from
the connection handle. Previously it was held in a global variable,
which was an icky way of doing things.
In a few special cases we also need the actual version string, which
is obtained directly from the database.
This is a somewhat convoluted attempt to retrieve the current LSN
of any node, regardless of whether in recovery or not, and if in
recovery, independent of whether streaming or recovering from
archive.
Return 0 if the node is not in recovery. In which case it's probably
rather pointless calling this function anyway.
Return -1 if the "last_seen" field has never been set (i.e. repmgrd
hasn't started yet).
As of PostgreSQL commit d9dd406f, variable length arrays are no longer
permitted. As they're not actually required anyway, just define appropriate
constants.
Also noted in GitHub #510.
In some circumstances, e.g. while performing a switchover, it is essential
that repmgrd does not take any kind of failover action, as this will put
the cluster into an incorrect state.
Previously it was necessary to stop repmgrd on all nodes (or at least
those nodes which repmgrd would consider as promotion candidates), however
this is a cumbersome and potentially risk-prone operation, particularly if the
replication cluster contains more than a couple of servers.
To prevent this issue from occurring, this patch introduces the ability
to "pause" repmgrd on all nodes wth a single command ("repmgr daemon pause")
which notifies repmgrd not to take any failover action until the node
is "unpaused" ("repmgr daemon unpause").
"repmgr daemon status" provides an overview of each node and whether repmgrd
is running, and if so whether it is paused.
"repmgr standby switchover" has been modified to automatically pause repmgrd
while carrying out the switchover.
See documentation for further details.
Previously, if the server being monitored was not available, repmgrd
would always close the existing connection handle and open a new one.
However, in some cases, e.g. a brief network outage, the existing
connection handle is still good and does not need to be reopened.
This could be particularly problematic if monitoring_history is on,
as this risks leaving orphan sessions on the primary which (given
a sufficiently unstable network) could lead to all available backends
being occupied.
Instead, during an outage we now use a new connection to verify
the server is accessible; if the old connection is still available
(e.g. following a short network interruption) we continue using that;
if not (e.g. the server was restarted), we use the new one.
Add more granular logging to help diagnose issues, and also keep track
of when the last monitoring statistics update was set and emit that
as DETAIL every time we emit a log status update.
It's unlikely we'll get an error in these cases, but you never know.
Also, with queries which return a list of node records, it's necessary
to call _populate_node_records() even if the query fails, so a properly
initalised, albeit empty list is returned to the caller.
Previously query texts were always logged at log level DEBUG, but
that doesn't help much in a normal production environment when
trying to identify the cause of issues.
Also make various other minor improvements to query logging and
handling of database errors.
Implements GitHub #498.
Some distributions may add extra information to PG_VERSION after
the actual version number (e.g. "10.4 (Debian 10.4-2.pgdg90+1)"), so
copy the version number string up until the first space is found.
GitHub #490.
a switchover.
We've found that this can cause some issues with postgres control
metadata (could be a postgres bug) so best thing is *not* no switchover
if there's a backup taking place.
It's also a bad idea from an architectual point of view, as a switchover
is supposed to be planed, so why perform it when we are taking backups.
GitHub #476.
In the default text output mode, list inactive slots.
In CSV output mode, list inactive slots as additional information;
add output line with number of missing slots and a list thereof.
Also document --csv output mode.
Usually a seperate user (typically "repmgr") is set up specifically to manage
the repmgr metadata, however there's no compelling requirement to do this, and
it's possible the database owner (usually: "postgres") will be used, in which
case it's possible the username will be left out of the conninfo string.
Addresses GitHub #437.
If repmgrd marks the local node as unavailable, and it was actually
restarting but a failover event occured before the next local node
check, failover will continue with the stale connection handle.
Add a final local node check just before starting the failover
process, so repmgrd can reconnect if it wasn't able to before.
If monitoring history not in use, there's no activity on the standby's
connection handle, so if e.g. the standby is restarted, PQstatus()
never returns CONNECTION_BAD and repmgrd never notices the connection
is stale. Therefore execute a throw-away statement at "monitor_interval_secs".
When establishing a superuser connection, the connection parameters
were being copied from the existing (non-superuser) connection, which
in some circumstances can lead to that user's password being
included in the copied parameter list. The password parameter, if set, will
now always be removed, which will cause libpq to retrieve the correct
one from the .pgpass file.
Addresses GitHub #400.
For events generated by these commands, it may be useful to know details
of the primary node. This makes following additional parameters available
to event notification scripts:
- %p: node ID of the primary
- %a: node name of the primary
- %c: conninfo string for the primary
Implements GitHub #375