Commit Graph

213 Commits

Author SHA1 Message Date
Ian Barwick 5e4bdb5a1b repmgrd: handle failover with two nodes in the primary location
If two nodes were in the primary location, and at least one node in
another location, the non-failed node in the primary location was not
recognising itself as a promotion candidate.

Addresses GitHub #407.
2018-04-02 20:51:27 +09:00
Ian Barwick a403da67bc Consolidate connection closure calls 2018-03-27 16:43:59 +09:00
Ian Barwick 0e55a60660 Add event "repmgrd_failover_aborted" 2018-03-21 13:23:06 +09:00
Ian Barwick 81c69e3677 repmgrd: fix typo 2018-03-21 12:36:15 +09:00
Ian Barwick 2a99dfa15b repmgrd: fix failover handling in "manual" mode
Regression was introduced in commit c7a585c555
2018-03-07 19:21:40 +09:00
Ian Barwick cdb504d700 Add event "repmgrd_shutdown"
Implements GitHub #393
2018-03-06 11:00:03 +09:00
Ian Barwick 0af2077bed repmgrd: add debug log output for "monitor_interval_secs" sleep in all modes 2018-03-06 10:56:21 +09:00
Ian Barwick bc766a48ed repmgrd: retry standby connection after cascading standby failover 2018-03-02 11:05:07 +09:00
Ian Barwick 55441f2729 repmgrd: add configuration file parameter "standby_reconnect_timeout"
This is used for determining a timeout when reconnecting to the standby
after executing the "follow_command". This will normally not need to be
set explicitly, but maybe useful in cases where the standby's startup
phase can last longer than usual.
2018-03-02 11:04:56 +09:00
Ian Barwick c1356b9e0d repmgrd: retry standby connection after "follow_command" executed
It's possible that the standby is still starting up after the "follow_command"
completes, so poll for a while until we get a connection.
2018-03-02 11:04:19 +09:00
Ian Barwick 22b3a74fa0 repmgrd: improve detection of status change from primary to standby
If repmgrd is running in degraded mode on a primary which has been stopped,
then manually been brought back online as a standby (e.g. by creating
recovery.conf and starting the server), ensure it not only detects the
change but automatically updates the node record so it can resume
monitoring the node as a standby.

Previously, repmgrd was looping waiting for the record to be updated
(as is done transparently when executing "repmgr node rejoin") but
if the record was not updated within the timeout period (e.g. by
"repmgr standby register) it would fail to resume monitoring as a
standby.

It seems reasonable to have repmgrd automatically update the node record,
as this will restore failover capability as quickly as possible. If this
is not desired, then the onus is on the user to shut down repmgrd while
making the desired changes.
2018-02-22 15:50:45 +09:00
Ian Barwick ec068e38a2 Remove --bdr-only configuration option
This was required for a specific use case during pre-release
development and is no longer needed now the physical streaming
replication handling is implemented.
2018-01-25 10:48:09 +09:00
Ian Barwick e64d965c6a repmgrd: document standby_[failure|recovery] event notifications
Also clean up the relevant code section.

Addresses GitHub #359.
2018-01-04 09:33:37 +09:00
Ian Barwick 26a9e848fd Update copyright notices to 2018 2018-01-02 10:19:46 +09:00
Ian Barwick 8c422d6084 Remove unneeded functions 2017-11-20 15:18:21 +09:00
Ian Barwick 08b443dce0 repmgrd: renable monitoring data recording when in archive recovery.
The warning emitted gives the impression that monitoring data shouldn't
be written if there's no streaming replication, but we can and should
do this as long as we have a primary connection.

Explictly document this in the code.

Also remove an unused variable warning.
2017-11-16 17:17:17 +09:00
Ian Barwick 9d432546bf repmgrd: don't fail over unless more than 50% of active nodes are visible. 2017-11-15 13:48:28 +09:00
Ian Barwick 3c557ebd8e repmgrd: finalize witness failover handling 2017-11-15 13:48:25 +09:00
Ian Barwick 4efeb52cba repmgrd: synchronise repmgr.nodes table on witness server 2017-11-15 13:48:21 +09:00
Ian Barwick 60422c66f9 repmgrd: handle witness server 2017-11-15 13:48:17 +09:00
Ian Barwick a31980b590 repmgrd: basic witness node monitoring 2017-11-15 13:48:11 +09:00
Ian Barwick a6cc4d80f0 Add "witness register" functionality 2017-11-15 13:47:45 +09:00
Ian Barwick 9908a9c662 repmgrd: detect role change from primary to standby
If repmgrd is monitoring a primary which is taken off-line, then later
restored as a standby, detect this change and resume monitoring
in standby node.

Addresses GitHub #338.
2017-11-10 17:19:30 +09:00
Ian Barwick 0230bafae1 repmgrd: updates related to node_id handling 2017-11-10 12:07:31 +09:00
Ian Barwick de577adc67 repmgrd: catch corner cases where monitoring data is not available 2017-11-09 22:27:09 +09:00
Ian Barwick fed17d49e3 repmgrd: ensure shmem is reinitialised after a restart 2017-11-09 19:31:21 +09:00
Ian Barwick d80763f974 repmgrd: misc fixes 2017-11-09 19:31:16 +09:00
Ian Barwick 331e982bdb repmgrd: fix priority/node_id tie-break check 2017-11-09 19:31:12 +09:00
Ian Barwick 6ac6e0733a repmgrd: simplify the candidate selection logic
All disconnected nodes will be in a static, known state, so as long as
each node has the same meta-information (repmgr.nodes) and is able
to retrieve the last receive LSN of the other nodes, it is possible
for each node to independently determine the best promotion candidate,
thereby reaching consensus without an explicit "voting" process.
2017-11-09 19:31:04 +09:00
Ian Barwick 79d21b516b repmgrd: fixes to failover handling
get_new_primary() returns NULL if no notification for the new primary has
been received, but the code was expecting it to return UNKNOWN_NODE_ID,
which was causing repmgrd to prematurely drop out of the new primary
detection loop if no notification had been received by the time the loop
started.

Also store the electoral term as a single row, single column table,
to ensure that all repmgrds see the same turn. It is then bumped
by the winning node after it gets promoted.

Various logging improvements.
2017-11-08 14:28:08 +09:00
Ian Barwick d6c27f8938 Standardize quoting in log messages 2017-10-04 09:34:59 +09:00
Ian Barwick a9f4a027a7 pgindent run 2017-09-11 11:14:13 +09:00
Ian Barwick 3447257ae4 repmgrd: minor fixes and comment updates 2017-09-08 20:59:21 +09:00
Ian Barwick e4f7dc8234 Add copyright notices 2017-09-08 13:27:39 +09:00
Ian Barwick 1ef00f5a3b repmgrd: parse "follow_command" during cascaded standby failover 2017-09-05 11:19:25 +09:00
Ian Barwick 78e6bdeebe Have repmgrd parse "standby follow --upstream-node-id=%n" 2017-09-04 13:42:50 +09:00
Ian Barwick ab6702891a Minor fixes to cascading standby failover. 2017-09-01 13:09:17 +09:00
Ian Barwick 154c76e5e7 repmgrd: improve cascaded standby failover
Check primary is available.
2017-08-29 15:29:17 +09:00
Ian Barwick e0888c1f62 repmgrd: handle SIGHUP 2017-08-29 12:55:13 +09:00
Ian Barwick df827c6518 Update repmgrd documentation 2017-08-29 11:04:30 +09:00
Ian Barwick 4a11551c2f repmgrd: handle local node failure 2017-08-28 10:31:43 +09:00
Ian Barwick fcd111ac4c Improve logging output during failover process 2017-08-24 22:44:03 +09:00
Ian Barwick db157ad9bc Update README 2017-08-24 17:43:01 +09:00
Ian Barwick eee8d65259 Update view "replication_status" 2017-08-24 15:05:13 +09:00
Ian Barwick a659132ea4 repmgrd: write monitoring statistics 2017-08-24 11:49:44 +09:00
Ian Barwick 8dfb7bbc7d repmgrd: handle promotion failure properly 2017-08-23 21:44:18 +09:00
Ian Barwick 6259463007 repmgrd: various fixes for "manual" failover mode 2017-08-23 10:56:55 +09:00
Ian Barwick 791640e3b4 repmgrd: never execute "service_promote_command" directly 2017-08-02 12:09:25 +09:00
Ian Barwick 7cf3b9b618 repmgrd: improve logging of BDR monitoring
Also always log information about event_notification command
2017-07-27 21:12:41 +09:00
Ian Barwick 56b2e9bb84 Rename/add configuration file options
In previous versions of repmgr, some options had ambiguous meanings,
and/or were used for slightly different purposes. This way we end
up with a couple more options (most of which probably won't need
adjusting) but greater clarity and flexibility.

Removed:

  master_reponse_timeout:
    renamed to "async_query_timeout", as this was its main usage

  retry_promote_interval_secs:
    replaced by "primary_notification_timeout"

Added:
  async_query_timeout:
    timeout (in seconds) when executing asynchronous queries

  primary_notification_timeout:
    number of seconds to wait for notification from the new primary
    after a failover

  primary_follow_timeout:
    number of seconds to wait for the new primary to become available
    when executing "repmgr standby follow"
2017-07-25 11:13:32 +09:00