Commit Graph

128 Commits

Author SHA1 Message Date
Ian Barwick
e2a362a171 "node rejoin": check for available replication slots on the rejoin target
"standby follow" did this already, but "node rejoin" didn't.
2020-02-03 17:03:20 +09:00
Ian Barwick
4d4ed3bcd6 Remove BDR 2.x support
The BDR 2.x support was conceptual only and was never used in
production. As BDR 2.x will be EOL'd shortly, there is no risk it will
be needed.
2020-01-16 09:52:42 +09:00
Ian Barwick
7fdf2f1778 Update copyright notices to 2020 2020-01-13 14:06:20 +09:00
Ian Barwick
f5044465cb Add function to safely modify postgresql.auto.conf
This is required for PostgreSQL 12 and later.
2019-08-14 16:57:42 +09:00
Ian Barwick
4ebc43fd63 Clean up variable usage in do_node_status()
Variable with the same name existed both at function level and within
local code blocks.
2019-08-14 14:15:41 +09:00
Ian Barwick
38b373e6df "node check": check role membership when trying to read pg_settings
From PostgreSQL 10, a member of the default roles "pg_monitor" and/or
"pg_read_all_settings" can read pg_settings without requiring superuser
privileges.

Previously, a hint was being emitted about making the repmgr user a
member of one of those groups, but no check for membership was being
made, meaning the check could only be run by a superuser.
2019-08-07 14:26:48 +09:00
Ian Barwick
3c8bab97d8 Fix variable declarations 2019-05-22 17:26:34 +09:00
Ian Barwick
dd78a16006 Change return type of is_downstream_node_attached() from bool to NodeAttached
This enables us to better determine whether a node is definitively
attached, definitively not attached, or if it was not possible to
determine the attached state.
2019-05-14 15:57:20 +09:00
Ian Barwick
52905f1eb3 Standardize on "ID: %i" when logging node IDs
Previously there was a mix of "id:", "node id:", "node ID:" and "node_id:".
2019-04-30 17:07:33 +09:00
Ian Barwick
bb42d8cba6 Fix calculation of maximum filename length 2019-03-28 12:40:29 +09:00
Ian Barwick
fe822a9eea Prevent potential file descriptor resource leak 2019-03-28 12:29:10 +09:00
Ian Barwick
03cd5a6028 Put closedir call in correct location 2019-03-28 12:08:42 +09:00
Ian Barwick
1e1c596446 Add various missing close() calls 2019-03-28 11:32:25 +09:00
Ian Barwick
314a1e8f4f use a constant to denote unknown replication lag 2019-03-20 17:26:04 +09:00
Ian Barwick
19bf4d7434 Count witness and zero-priority nodes in visibility check 2019-03-14 11:17:51 +09:00
Ian Barwick
9823978f41 repmgrd: don't wait for WAL receiver to reconnect during failover
If the WAL receiver has been temporarily disabled, we don't want to
wait for it to start up as it may not be able to at that point; we do
however need to reset "wal_retrieve_retry_interval".
2019-03-06 15:54:56 +09:00
Ian Barwick
1615353f48 repmgrd: optionally disconnect WAL receivers during failover
This is intended to ensure that all nodes have a constant LSN while
making the failover decision.

This feature is experimental and needs to be explicitly enabled with the
configuration file option "standby_disconnect_on_failover".

Note enabling this option will result in a delay in the failover decision
until the WAL receiver is disconnected on all nodes.
2019-03-06 15:53:57 +09:00
Ian Barwick
71d151ca87 Don't check status of logical replication slots
We only want to check the status of physical replication slots
to determine whether a streaming replication standby has become
detached and there is therefore a risk of uncontrolled WAL buildup
on the local node.

It's not feasible to second-guess the state of logical replication
slots.
2019-02-23 10:09:43 +09:00
Ian Barwick
de70fd42dc node check: simplify output generation in --is-shutdown-cleanly check 2019-02-22 10:49:06 +09:00
Ian Barwick
aeb9639ed9 node rejoin: add more log detail during rejoin success check
Stating what is actually being checked where might be useful
when diagnosing potential issues.
2019-02-13 15:29:39 +09:00
Ian Barwick
790bec21dd node rejoin: handle case where node to rejoin was primary
In that case the minRecoveryPoint* fields may be empty.
2019-02-12 13:31:25 +09:00
Ian Barwick
a0dc673439 "node rejoin": use minRecoveryPointTLI for comparing timelines 2019-02-12 13:31:21 +09:00
Ian Barwick
cce8b76171 "standby switchover": abort if promotion candidate has WAL replay paused
If replay is paused, we can't be really sure that more WAL will be received
between the check and the promote operation, which would risk the promote
operation not taking place during the switchover (it would happen
as soon as WAL replay is resumed and pending WAL is replayed).

Therefore we simply quit with an informative slew of messages and
leave the user to sort it out.

GitHub #540.
2019-02-05 16:32:39 +09:00
Ian Barwick
f9a1861ded Refactor ReplInfo struct handling
Eventually we'll want to have this contain the optional replication
info contained in the t_node_info struct, which should then contain a
pointer to a ReplInfo struct.
2019-02-02 18:39:24 +09:00
Ian Barwick
20b79f998c Define some previously magic numbers 2019-02-01 19:14:16 +09:00
Ian Barwick
64bb034d34 "node rejoin": catch corner case where repmgr metadata is outdated
If the rejoin target is not in recovery, but not registered as primary
(we detect this by attempting to connect to the registered primary)
we abort and suggest fixing the repmgr metadata first.
2019-01-31 11:54:05 +09:00
Ian Barwick
59eca2be30 node rejoin: improve error code handling
- return ERR_REJOIN_FAIL in all cases where the rejoin operation fails
 - ensure ERR_FOLLOW_FAIL is not returned
 - document error codes
2019-01-24 10:31:45 +09:00
Ian Barwick
dfe57d2406 "node rejoin": log pg_rewind command as DETAIL rather than DEBUG 2019-01-23 17:15:07 +09:00
Ian Barwick
061932d023 "node rejoin": verify status of rejoin target
This adapts the code previously added to "standby follow" to verify
whether the rejoin target can actually be rejoined.
2019-01-23 17:08:55 +09:00
Ian Barwick
3f5762e03a Refactor upstream attachment check code
Move it from the "standby follow" code to an independent function so it can
be used in other contexts, e.g. "node rejoin".
2019-01-23 15:11:42 +09:00
Ian Barwick
42fa9a2a88 Log node rejoin failure as ERROR 2019-01-23 13:55:40 +09:00
Ian Barwick
f23065e041 Fix typo in log message 2019-01-23 13:53:29 +09:00
Ian Barwick
7dce3ed234 Update copyright notices to 2019 2019-01-21 14:54:35 +09:00
Ian Barwick
0b3a310802 Add --data-directory-config option to "repmgr node check"
Implements part of GitHub #523.
2019-01-16 16:03:44 +09:00
Ian Barwick
9e90fcd584 "standby follow": verify status of follow target
This commit adds infrastruture for repmgr to be able to check
whether one standby can attach to another node, regardless whether
it is a standby or a primary.

This is intended to prevent a node from attempting to follow a
node whose timeline has diverged. The --dry-run option makes
it possible to test a follow operation before it is carried out.

As a useful side-effect this makes it possible for a standby to
follow another standby.

This is an initial implementation; documentation and possibly
further changes to follow.
2018-11-29 17:14:38 +09:00
Ian Barwick
74c44a7178 doc: document "repmgr node service"
This was originally intended for internal use, but it's mentioned
several times in the documentation and is useful for diagnostic
purposes.
2018-11-28 12:58:07 +09:00
Ian Barwick
793d83b22c Refactor server version detection
Most of the time we can simply get the version number directly from
the connection handle. Previously it was held in a global variable,
which was an icky way of doing things.

In a few special cases we also need the actual version string, which
is obtained directly from the database.
2018-11-22 21:30:31 +09:00
Ian Barwick
61c91df332 "repmgr node": use appendPQExpBufferStr/-Char() where appropriate 2018-10-03 14:09:29 +09:00
Ian Barwick
b346914d4d repmgr: fix "Missing replication slots" label in "node check"
Per report in GitHub #507.
2018-10-03 13:53:52 +09:00
Ian Barwick
9681708b1a repmgr: improve slot handling in "node rejoin"
On the rejoined node, if a replication slot for the new upstream exists
(which is typically the case after a failover), delete that slot.

Also emit a warning about any inactive replication slots which may need
to be cleaned up manually.

GitHub #499.
2018-08-30 12:24:13 +09:00
Ian Barwick
a5cfc244bc repmgr: have "node status" check for missing downstream nodes
This matches the behaviour of "node check".
2018-07-18 10:27:19 +09:00
Ian Barwick
ae60caacdd repmgr: make "node check" and "node status" return ERR_NODE_STATUS when appropriate
If any issue is detected (and "node check" is not being executed with a specific
individual check), "ERR_NODE_STATUS" is returned.
2018-07-05 14:31:06 +09:00
Ian Barwick
92d0e6809b repmgr: "cluster show" to return non-zero value if an issue encountered 2018-07-05 13:32:50 +09:00
Ian Barwick
b2081dca52 De-overload configuration file parameter "standby_reconnect_timeout"
Currently the (very generic sounding) "standby_reconnect_timeout" configuration
file parameter is used in several different contexts and it would be useful
to have more granular control over the different timeouts it's used to configure.

This patch introduces "node_rejoin_timeout", used in place of "standby_reconnect_timeout"
(which wasn't documented) when "repmgr node rejoin" is executed, to determine
how long to wait for the node to rejoin the replication cluster.

Additionally "repmgrd_standby_startup_timeout" is introduced as a timeout for
failover situations, when repmgrd executes "repmgr standby follow" to follow
a new primary, and waits for the standby to restart and become available
for connections.

"standby_reconnect_timeout" is now only relevant for "repmgr standby switchover".

Implements GitHub #454.
2018-06-28 18:00:55 +09:00
Ian Barwick
080a29c33b node check: add --missing-slots check
This enables an explicit check for slots which should exist (according
to the repmgr metadata) but which aren't present.
2018-06-22 17:21:40 +09:00
Ian Barwick
dd7a4068d2 node check: implement CSV output
This is advertised in the --help output and placeholder code was in
place, but it wasn't actually implemented.
2018-06-22 13:14:57 +09:00
Ian Barwick
fcf237fe31 node status: improve output and documentation
In the default text output mode, list inactive slots.

In CSV output mode, list inactive slots as additional information;
add output line with number of missing slots and a list thereof.

Also document --csv output mode.
2018-06-22 11:46:50 +09:00
Ian Barwick
4d70a667fb node check: clarify status information for witness server
Previously the output gave the impression the server was a primary,
which is technically the case, but it's not the actual cluster primary.

Also output an error if the node is in recovery, which is unlikely but
you never know.
2018-06-22 10:15:45 +09:00
Ian Barwick
0f97a98f28 repmgr: don't count witness node as a standby when running "node status"
Addresses GitHub #451.
2018-06-21 13:06:18 +09:00
Ian Barwick
269e3242c8 "repmgr node ...": update comments and formatting 2018-06-21 12:12:07 +09:00