75 Commits

Author SHA1 Message Date
Ian Barwick
63bdc19132 repmgrd: ensure local node is counted as quorum member
Rename "standby_nodes" to "sibling_nodes" to make it clearer in the
code what total is actually provided by the struct.

Addresses GitHub #439.
2018-06-01 17:19:40 +09:00
Ian Barwick
0ffaff75df repmgrd: ensue degraded monitoring timeout works on standby
Parameter "degraded_monitoring_timeout" was not being acted on when
monitoring a streaming replication standby.

Addresses GitHub #439.
2018-05-31 17:53:31 +09:00
Martín Marqués
2dfe1d18e9 Fix typo in a code comment 2018-05-19 12:29:04 -03:00
Ian Barwick
67ccd4dcb3 repmgrd: don't explicitly close connections on shutdown 2018-04-30 15:13:30 +09:00
Ian Barwick
f86e89ba45 repmgrd: notify sibling nodes to follow new primary after pg_ctl timeout
If "pg_ctl promote" fails due to a timeout, but the promotion itself succeeds,
have repmgrd on the new primary explicitly notify any sibling nodes to
follow it.

Previously the sibling nodes would wait "primary_notification_timeout" seconds
before attempting to discover the new primary.

This (and preceding commit eac80ae) address GitHub #425.
2018-04-27 11:59:00 +09:00
Ian Barwick
a6d0ba07ed repmgrd: handle pg_ctl timeout
It's possible "pg_ctl promote" will timeout, causing "repmgr standby
follow" to return with an error; however the promotion itself will usually
succeed, so detect this case and handle accordingly.
2018-04-26 19:23:26 +09:00
Ian Barwick
242fa287b4 repmgrd: catch corner case in standby connection handle check
If repmgrd marks the local node as unavailable, and it was actually
restarting but a failover event occured before the next local node
check, failover will continue with the stale connection handle.

Add a final local node check just before starting the failover
process, so repmgrd can reconnect if it wasn't able to before.
2018-04-24 21:55:36 +09:00
Ian Barwick
fa908432c8 Minor doc and log output tweaks 2018-04-24 21:08:31 +09:00
Ian Barwick
afa942fef6 repmgrd: prevent standby connection handle from going stale
If monitoring history not in use, there's no activity on the standby's
connection handle, so if e.g. the standby is restarted, PQstatus()
never returns CONNECTION_BAD and repmgrd never notices the connection
is stale. Therefore execute a throw-away statement at "monitor_interval_secs".
2018-04-23 23:51:03 +09:00
Ian Barwick
90cba78f52 repmgrd: tweak event notifications on standby failure
The event notification was only being created if there was a valid
primary connection; it should be created in any case, so an event
notification script can be executed.
2018-04-17 10:27:25 +09:00
Ian Barwick
65371489c6 repmgrd: handle failover with two nodes in the primary location
If two nodes were in the primary location, and at least one node in
another location, the non-failed node in the primary location was not
recognising itself as a promotion candidate.

Addresses GitHub #407.
2018-03-30 12:17:34 +09:00
Ian Barwick
37e53108a2 Consolidate connection closure calls 2018-03-27 08:52:23 +09:00
Ian Barwick
7e2af17783 repmgrd: tweak log notices when marking a standby as failed
Announce what we're going to do (set the node record inactive) *before*
performing the action. Makes reading the log slightly easier.
2018-03-23 13:27:37 +08:00
Ian Barwick
b4272853e7 Add event "repmgrd_failover_aborted" 2018-03-23 10:44:00 +08:00
Ian Barwick
d9cc09cee4 repmgrd: fix typo 2018-03-21 12:36:51 +09:00
Ian Barwick
9aea5b8aa7 repmgrd: fix failover handling in "manual" mode
Regression was introduced in commit c7a585c555
2018-03-06 22:35:51 +09:00
Ian Barwick
9c72c0d66e Add event "repmgrd_shutdown"
Implements GitHub #393
2018-03-06 10:59:54 +09:00
Ian Barwick
5a52917421 repmgrd: add debug log output for "monitor_interval_secs" sleep in all modes 2018-03-05 14:23:58 +09:00
Ian Barwick
fe594c95ad repmgrd: retry standby connection after cascading standby failover 2018-02-28 21:15:11 +09:00
Ian Barwick
60e63feaca repmgrd: add configuration file parameter "standby_reconnect_timeout"
This is used for determining a timeout when reconnecting to the standby
after executing the "follow_command". This will normally not need to be
set explicitly, but maybe useful in cases where the standby's startup
phase can last longer than usual.
2018-02-28 18:56:33 +09:00
Ian Barwick
5e8b41e221 repmgrd: retry standby connection after "follow_command" executed
It's possible that the standby is still starting up after the "follow_command"
completes, so poll for a while until we get a connection.
2018-02-28 15:35:47 +09:00
Ian Barwick
c7a585c555 repmgrd: improve log output
- emit explicit startup NOTICE
- emit NOTICE when falling back to degraded monitoring on a primary node
- improve log message and event notification details when monitoring
  a former primary which has been reconnected as a standby
2018-02-28 12:35:13 +09:00
Ian Barwick
829cf5cca4 repmgrd: improve detection of status change from primary to standby
If repmgrd is running in degraded mode on a primary which has been stopped,
then manually been brought back online as a standby (e.g. by creating
recovery.conf and starting the server), ensure it not only detects the
change but automatically updates the node record so it can resume
monitoring the node as a standby.

Previously, repmgrd was looping waiting for the record to be updated
(as is done transparently when executing "repmgr node rejoin") but
if the record was not updated within the timeout period (e.g. by
"repmgr standby register) it would fail to resume monitoring as a
standby.

It seems reasonable to have repmgrd automatically update the node record,
as this will restore failover capability as quickly as possible. If this
is not desired, then the onus is on the user to shut down repmgrd while
making the desired changes.
2018-02-22 11:35:47 +09:00
Ian Barwick
6dc1969ad5 Remove --bdr-only configuration option
This was required for a specific use case during pre-release
development and is no longer needed now the physical streaming
replication handling is implemented.
2018-01-18 13:30:47 +09:00
Ian Barwick
486f8e5a2c repmgrd: document standby_[failure|recovery] event notifications
Also clean up the relevant code section.

Addresses GitHub #359.
2018-01-04 09:34:49 +09:00
Ian Barwick
1521657965 Update copyright notices to 2018 2018-01-02 10:20:09 +09:00
Ian Barwick
f6a6df3600 repmgrd: renable monitoring data recording when in archive recovery.
The warning emitted gives the impression that monitoring data shouldn't
be written if there's no streaming replication, but we can and should
do this as long as we have a primary connection.

Explictly document this in the code.

Also remove an unused variable warning.
2017-11-20 15:29:21 +09:00
Ian Barwick
67e27f9ecd Remove unneeded functions 2017-11-20 15:26:32 +09:00
Ian Barwick
53ebde8f33 repmgrd: don't fail over unless more than 50% of active nodes are visible. 2017-11-15 14:04:41 +09:00
Ian Barwick
5e9d50f8ca repmgrd: finalize witness failover handling 2017-11-15 14:04:37 +09:00
Ian Barwick
347e753c27 repmgrd: synchronise repmgr.nodes table on witness server 2017-11-15 14:04:34 +09:00
Ian Barwick
2f978847b1 repmgrd: handle witness server 2017-11-15 14:04:30 +09:00
Ian Barwick
e02ddd0f37 repmgrd: basic witness node monitoring 2017-11-15 14:04:23 +09:00
Ian Barwick
31b856dd9f Add "witness register" functionality 2017-11-15 14:03:54 +09:00
Ian Barwick
e16eb42693 repmgrd: detect role change from primary to standby
If repmgrd is monitoring a primary which is taken off-line, then later
restored as a standby, detect this change and resume monitoring
in standby node.

Addresses GitHub #338.
2017-11-15 14:03:26 +09:00
Ian Barwick
cbc97d84ac repmgrd: updates related to node_id handling 2017-11-15 14:03:15 +09:00
Ian Barwick
96fe7dd2d6 repmgrd: catch corner cases where monitoring data is not available 2017-11-15 14:03:12 +09:00
Ian Barwick
13935a88c9 repmgrd: ensure shmem is reinitialised after a restart 2017-11-09 19:51:31 +09:00
Ian Barwick
5275890467 repmgrd: misc fixes 2017-11-09 19:51:26 +09:00
Ian Barwick
7f865fdaf3 repmgrd: fix priority/node_id tie-break check 2017-11-09 19:51:22 +09:00
Ian Barwick
a3428e4d8a repmgrd: simplify the candidate selection logic
All disconnected nodes will be in a static, known state, so as long as
each node has the same meta-information (repmgr.nodes) and is able
to retrieve the last receive LSN of the other nodes, it is possible
for each node to independently determine the best promotion candidate,
thereby reaching consensus without an explicit "voting" process.
2017-11-09 19:51:13 +09:00
Ian Barwick
03b9475755 repmgrd: fixes to failover handling
get_new_primary() returns NULL if no notification for the new primary has
been received, but the code was expecting it to return UNKNOWN_NODE_ID,
which was causing repmgrd to prematurely drop out of the new primary
detection loop if no notification had been received by the time the loop
started.

Also store the electoral term as a single row, single column table,
to ensure that all repmgrds see the same turn. It is then bumped
by the winning node after it gets promoted.

Various logging improvements.
2017-11-09 19:51:09 +09:00
Ian Barwick
d6c27f8938 Standardize quoting in log messages 2017-10-04 09:34:59 +09:00
Ian Barwick
a9f4a027a7 pgindent run 2017-09-11 11:14:13 +09:00
Ian Barwick
3447257ae4 repmgrd: minor fixes and comment updates 2017-09-08 20:59:21 +09:00
Ian Barwick
e4f7dc8234 Add copyright notices 2017-09-08 13:27:39 +09:00
Ian Barwick
1ef00f5a3b repmgrd: parse "follow_command" during cascaded standby failover 2017-09-05 11:19:25 +09:00
Ian Barwick
78e6bdeebe Have repmgrd parse "standby follow --upstream-node-id=%n" 2017-09-04 13:42:50 +09:00
Ian Barwick
ab6702891a Minor fixes to cascading standby failover. 2017-09-01 13:09:17 +09:00
Ian Barwick
154c76e5e7 repmgrd: improve cascaded standby failover
Check primary is available.
2017-08-29 15:29:17 +09:00