Commit Graph

327 Commits

Author SHA1 Message Date
Ian Barwick
cd3312496e Rename functions which return an LSN for clarity 2019-02-06 09:32:53 +09:00
Ian Barwick
f62b3b2868 Fix Pg10+ function names 2019-02-05 13:37:35 +09:00
Ian Barwick
701944c194 "standby promote": add check for WAL replay status if replay is paused
If WAL replay is paused but WAL is still pending replay, PostgreSQL will ignore
the promote request until WAL replay is unpaused. This may lead to the standby
being promoted at an unpredictable point in time outside of repmgr's
control. Moreover it may not be obvious that this is happening, or why, and
it will appear that an apparently successful promotion attempt has not
actually worked.

To prevent this from happening, repmgr will now refuse to promote the
standy if WAL replay is paused *and* WAL is still pending replay.

GitHub #540.
2019-02-05 13:30:37 +09:00
Ian Barwick
92c73b68a0 Clean up dbutils.c
Put functions into the same "section" as noted in the header file.
2019-02-05 09:36:54 +09:00
Ian Barwick
f9a1861ded Refactor ReplInfo struct handling
Eventually we'll want to have this contain the optional replication
info contained in the t_node_info struct, which should then contain a
pointer to a ReplInfo struct.
2019-02-02 18:39:24 +09:00
Ian Barwick
20b79f998c Define some previously magic numbers 2019-02-01 19:14:16 +09:00
Ian Barwick
bdb4f66a9d Add an Assert() to detect attempted array overflow in param_set...() functions
Previously the code would do nothing if an attempt was made to add parameters
if the array is already full.

As the array is designed to contain all valid libpq connection parameters,
there's no reason it should ever "overflow" like this. If there is, then
it means the caller is attempting to add invalid values. Add an Assert()
so we can easily detect this in the unlikely event it ever occurs.

Noted after examining the issue raised in GitHub #533, which is nonsensical
as it implies we'd be OK with writing beyond the end of the array, however
it doesn't hurt to make it a bit clearer what is happening and why.
2019-01-31 14:11:00 +09:00
Ian Barwick
32b81e7d49 "daemon start": initial implementation 2019-01-29 13:01:14 +09:00
Ian Barwick
a48d408e4e Consistently log strerror output as DETAIL 2019-01-29 12:10:55 +09:00
Ian Barwick
1980deb480 repmgrd: check for a change to the upstream node
If the upstream node has changed, for example after "repmgr standby follow"
was manually executed, restart monitoring to ensure repmgrd is monitoring the
correct node.
2019-01-22 13:33:13 +09:00
Ian Barwick
7dce3ed234 Update copyright notices to 2019 2019-01-21 14:54:35 +09:00
Ian Barwick
d4e993a240 Improve handling of connection URIs when executing remote commands
Previously, if connection URIs were in use and "repmgr standby switchover"
was executed, repmgr would pass the connection URI as-is to the demotion
candidate to execute "repmgr node rejoin". However the presence of
unescaped ampersands in the connection URI was causing the rejoin command
to be incorrectly executed.

Addresses GitHub #525.
2019-01-14 11:11:51 +09:00
Ian Barwick
40408a1734 repmgrd: check binary and extension major versions match
repmgr requires that the same "major version" (e.g. 4.3) is present
on all nodes, otherwise - particularly in the case of repmgrd - it's
highly likely things won't work as expected.

Implements part of GitHub #515.
2019-01-07 15:39:40 +09:00
Ian Barwick
313aa3c5d7 Refactor follow verification to reduce need for CHECKPOINT
A CHECKPOINT is not always required; hopefully we can narrow it down
to one corner case where we need to determine the minium recovery
location.

Also get local timeline ID via IDENTIFY_SYSTEM, as fetching it from
pg_control risks returning the prior timeline ID if the timeline
switch has just taken place and no restart point has yet occurred.
2018-12-04 15:27:22 +09:00
Ian Barwick
c53782cda3 Fix typo in query 2018-11-29 15:24:49 +09:00
Ian Barwick
66b40ffc68 Simplify function create_replication_slot()
Following the changes in 793d83b, it's no longer necessary to
pass the server version number.
2018-11-29 14:35:01 +09:00
Ian Barwick
a6a2be2239 Teach witness repmgrd to deal with the absence of a primary
Previously it would refuse to start if the primary was not reachable,
the thinking being that it's pointless trying to monitor an incomplete
cluster.

However following an aborted failover situation, repmgrd will restart
monitoring and on the witness server, this will lead to it aborting
itself due to to continuing absence of primary.

To resolve this, witness repmgrd will now start monitoring in degraded
mode if no primary is found in the hope a primary will reappear at
some point.
2018-11-29 12:15:41 +09:00
Ian Barwick
bdcc4d9e83 Check correct result status in ...primary_last_seen() functions 2018-11-29 11:08:28 +09:00
Ian Barwick
793d83b22c Refactor server version detection
Most of the time we can simply get the version number directly from
the connection handle. Previously it was held in a global variable,
which was an icky way of doing things.

In a few special cases we also need the actual version string, which
is obtained directly from the database.
2018-11-22 21:30:31 +09:00
Ian Barwick
0f4e04e61e Add function get_current_lsn()
This is a somewhat convoluted attempt to retrieve the current LSN
of any node, regardless of whether in recovery or not, and if in
recovery, independent of whether streaming or recovering from
archive.
2018-11-22 19:31:49 +09:00
Ian Barwick
80a280cbf4 Add function get_timeline_history()
This will be required for verifying whether one node is able to
follow another node.
2018-11-22 15:26:50 +09:00
Ian Barwick
784c9c4793 repmgrd: return predictable default values for get_primary_last_seen()
Return 0 if the node is not in recovery. In which case it's probably
rather pointless calling this function anyway.

Return -1 if the "last_seen" field has never been set (i.e. repmgrd
hasn't started yet).
2018-11-21 11:30:32 +09:00
Ian Barwick
0caec90d81 repmgrd: set primary last seen 2018-11-21 11:30:27 +09:00
Ian Barwick
c3bc5585d9 Add sanity check for extension version
This should cover the cases where the "repmgr" extension was installed
manually but not updated, or an upgrade was not fully completed.
2018-10-31 11:16:36 +09:00
Ian Barwick
c336e384ab Support "pg_promote()" function (PostgreSQL 12 and later)
This is an experimental feature.
2018-10-26 11:02:45 +09:00
Ian Barwick
a459c60145 Avoid defining variable-length arrays
As of PostgreSQL commit d9dd406f, variable length arrays are no longer
permitted. As they're not actually required anyway, just define appropriate
constants.

Also noted in GitHub #510.
2018-10-26 10:09:45 +09:00
Ian Barwick
3e38759c02 use appendPQExpBufferStr/-Char() consistently 2018-10-04 08:42:42 +09:00
Ian Barwick
2491b8ae52 Add functionality to "pause" repmgrd
In some circumstances, e.g. while performing a switchover, it is essential
that repmgrd does not take any kind of failover action, as this will put
the cluster into an incorrect state.

Previously it was necessary to stop repmgrd on all nodes (or at least
those nodes which repmgrd would consider as promotion candidates), however
this is a cumbersome and potentially risk-prone operation, particularly if the
replication cluster contains more than a couple of servers.

To prevent this issue from occurring, this patch introduces the ability
to "pause" repmgrd on all nodes wth a single command ("repmgr daemon pause")
which notifies repmgrd not to take any failover action until the node
is "unpaused" ("repmgr daemon unpause").

"repmgr daemon status" provides an overview of each node and whether repmgrd
is running, and if so whether it is paused.

"repmgr standby switchover" has been modified to automatically pause repmgrd
while carrying out the switchover.

See documentation for further details.
2018-09-27 16:42:10 +09:00
Ian Barwick
688337dec3 repmgr: add "--node-id" option to "cluster cleanup"
Implements GitHub #493.
2018-09-25 15:56:40 +09:00
Ian Barwick
b0a2ee2259 get_all_node_records(): display any error encountered and return success status
In many cases we'll want to bail out with an error if the node list can't
be retrieved for any reason. This saves some repetitive coding.
2018-09-13 10:14:43 +09:00
Ian Barwick
17e75f6b31 repmgrd: improve reconnection handling
Previously, if the server being monitored was not available, repmgrd
would always close the existing connection handle and open a new one.

However, in some cases, e.g. a brief network outage, the existing
connection handle is still good and does not need to be reopened.

This could be particularly problematic if monitoring_history is on,
as this risks leaving orphan sessions on the primary which (given
a sufficiently unstable network) could lead to all available backends
being occupied.

Instead, during an outage we now use a new connection to verify
the server is accessible; if the old connection is still available
(e.g. following a short network interruption) we continue using that;
if  not (e.g. the server was restarted), we use the new one.
2018-08-30 15:46:08 +09:00
Ian Barwick
ceeb6d7130 repmgrd: improve monitoring statistics logging
Add more granular logging to help diagnose issues, and also keep track
of when the last monitoring statistics update was set and emit that
as DETAIL every time we emit a log status update.
2018-08-30 12:36:59 +09:00
Ian Barwick
3573950425 Add additional query error logging
It's unlikely we'll get an error in these cases, but you never know.

Also, with queries which return a list of node records, it's necessary
to call _populate_node_records() even if the query fails, so a properly
initalised, albeit empty list is returned to the caller.
2018-08-29 10:25:43 +09:00
Ian Barwick
c1586e39b7 Log text of failed queries at log level ERROR
Previously query texts were always logged at log level DEBUG, but
that doesn't help much in a normal production environment when
trying to identify the cause of issues.

Also make various other minor improvements to query logging and
handling of database errors.

Implements GitHub #498.
2018-08-29 10:08:52 +09:00
Ian Barwick
e1e59e85d7 repmgr: add "cluster_cleanup" event
GitHub #492.
2018-08-24 09:20:05 +09:00
Ian Barwick
34c4f4c3f8 repmgr: truncate version string if necessary
Some distributions may add extra information to PG_VERSION after
the actual version number (e.g. "10.4 (Debian 10.4-2.pgdg90+1)"), so
copy the version number string up until the first space is found.

GitHub #490.
2018-08-14 09:55:23 +09:00
Ian Barwick
44a224ad92 repmgrd: fix configuration file reloading
Don't allow "promote_command" or "follow_command" to be empty.

GitHub #486.
2018-08-02 16:35:26 +09:00
Ian Barwick
7ecfb333b9 doc: add note about switchover and exclusive backups
Also rename server_not_in_exclusive_backup_mode() to avoid double
negatives.

GitHub #476.
2018-07-19 16:02:31 +09:00
Martín Marqués
8f13a66aaa Check that there is no exclusive backup taking place while we perform
a switchover.

We've found that this can cause some issues with postgres control
metadata (could be a postgres bug) so best thing is *not* no switchover
if there's a backup taking place.

It's also a bad idea from an architectual point of view, as a switchover
is supposed to be planed, so why perform it when we are taking backups.

GitHub #476.
2018-07-19 16:02:21 +09:00
Ian Barwick
ef35d071bf Fix is_active_bdr_node() query for BDR 2.x
Copy/paste error when adapting the query for BDR 3.x.
2018-07-19 09:50:30 +09:00
Ian Barwick
7decc7975f Fix BDR version check
repgexp_match() is only available from PostgreSQL 10 and later.
2018-07-18 10:54:16 +09:00
Ian Barwick
fcf237fe31 node status: improve output and documentation
In the default text output mode, list inactive slots.

In CSV output mode, list inactive slots as additional information;
add output line with number of missing slots and a list thereof.

Also document --csv output mode.
2018-06-22 11:46:50 +09:00
Ian Barwick
0f97a98f28 repmgr: don't count witness node as a standby when running "node status"
Addresses GitHub #451.
2018-06-21 13:06:18 +09:00
Ian Barwick
836d2125fe Improve BDR3 node query
We can get everything we need from bdr.node_summary
2018-06-15 14:30:06 +09:00
Ian Barwick
bf0d67c60a Add repmgr.nodes to the BDR replication set 2018-06-15 14:29:08 +09:00
Ian Barwick
108c3a36fb Enable creation of repmgr extension on BDR3 node 2018-06-15 14:26:47 +09:00
Ian Barwick
8377704596 Convert BDR query functions to handle BDR2/BDR3 2018-06-15 14:26:07 +09:00
Ian Barwick
4f642f8332 Detect and store BDR major version number when executing "is_bdr_db()"
BDR3 metadata structure is very different to BDR1/2, so we'll need to
generate queries according to version.
2018-06-15 14:25:55 +09:00
Ian Barwick
bcab4bc391 _create_event(): log event and node ID for debugging 2018-06-12 10:30:30 +09:00
Ian Barwick
b1b49748a7 "config_file" is MAXPGPATH, not MAXLEN
The two values are the same anyway, so change is more for consistency.
2018-05-24 15:52:57 +09:00