Commit Graph

1932 Commits

Author SHA1 Message Date
Ian Barwick 0e2f3e563a doc: various updates 2019-03-13 16:55:32 +09:00
Ian Barwick 8c4421d110 doc: merge repmgrd witness server description into failover section 2019-03-13 16:12:17 +09:00
Ian Barwick 69cb3f1e82 doc: merge repmgrd split network handling description into failover section 2019-03-13 16:12:14 +09:00
Ian Barwick 960acfeb3c doc: merge repmgrd monitoring description into operating section 2019-03-13 16:12:11 +09:00
Ian Barwick a8d50a5b98 doc: merge repmgrd degraded monitoring description into operation section 2019-03-13 16:12:06 +09:00
Ian Barwick 11e5993bf5 doc: merge repmgrd notes into operation documentation 2019-03-13 16:12:03 +09:00
Ian Barwick 09861a5604 doc: merge repmgrd pause documentation into overview 2019-03-13 16:11:59 +09:00
Ian Barwick 89bba77d4d doc: initial repmgrd doc refactoring 2019-03-13 16:11:55 +09:00
Ian Barwick dd6ece326f doc: update repmgrd configuration documentation 2019-03-13 13:34:08 +09:00
Ian Barwick 573d027db6 repmgrd: various minor logging improvements 2019-03-13 11:27:17 +09:00
Ian Barwick 1afb41647b repmgrd: remove global variable
Make the "sibling_nodes" local, and pass by reference where relevant.
2019-03-12 17:12:23 +09:00
Ian Barwick fc397f25f6 repmgrd: enable election rerun
If "failover_validation_command" is set, and the command returns an error,
rerun the election.

There is a pause between reruns to avoid "churn"; the length of this pause
is controlled by the configuration parameter "election_rerun_interval".
2019-03-12 17:12:19 +09:00
Ian Barwick 99923f5ffc Remove redundant struct allocation 2019-03-11 19:06:07 +09:00
Ian Barwick b9cdcd55e7 doc: update list of reloadable repmgrd configuration options 2019-03-11 16:18:10 +09:00
Ian Barwick db87ff46fd doc: document "failover_validation_command" 2019-03-11 15:02:33 +09:00
Ian Barwick 2a8f8d8400 doc: expand repmgrd configuration section 2019-03-11 14:50:33 +09:00
Ian Barwick 4ef706c2ca Execute "failover_validation_command" when only one standby exists 2019-03-08 12:19:37 +09:00
Ian Barwick 663c2e75b4 Make "failover_validation_command" reloadable 2019-03-08 09:27:19 +09:00
Ian Barwick db0d71c6a7 Initial implementation of "failover_validation_command" 2019-03-08 08:49:15 +09:00
Ian Barwick 6f4f56dd8c Make recently added configuration options reloadable 2019-03-07 10:58:25 +09:00
Ian Barwick 33fefd9f52 Add configuration option "primary_visibility_consensus"
This determines whether repmgrd should continue with a failover if
one or more nodes report they can still see the standby.
2019-03-07 10:41:42 +09:00
Ian Barwick a3f90d2bba Add configuration option "sibling_nodes_disconnect_timeout"
This controls the maximum length of time in seconds that repmgrd will
wait for other standbys to disconnect their WAL receivers in a failover
situation.

This setting is only used when "standby_disconnect_on_failover" is set to "true".
2019-03-06 15:56:21 +09:00
Ian Barwick 2ed044c358 Reset "wal_retrieve_retry_interval" for all nodes 2019-03-06 15:55:03 +09:00
Ian Barwick 9823978f41 repmgrd: don't wait for WAL receiver to reconnect during failover
If the WAL receiver has been temporarily disabled, we don't want to
wait for it to start up as it may not be able to at that point; we do
however need to reset "wal_retrieve_retry_interval".
2019-03-06 15:54:56 +09:00
Ian Barwick ae8171e461 Improve logging/sanity checking for "node control" options 2019-03-06 15:54:30 +09:00
Ian Barwick 1f8f64d57c Improve logging when disabling/enabling WAL receiver
Also check action is being run on node which is in recovery.
2019-03-06 15:54:26 +09:00
Ian Barwick 13c650fa83 Check for WAL receiver start up 2019-03-06 15:54:23 +09:00
Ian Barwick f85b4cd98e Log warning if "standby_disconnect_on_failover" used on pre-9.5
"standby_disconnect_on_failover" requires availability of "wal_retrieve_retry_interval",
which is available from PostgreSQL 9.5.

9.4 will fall out of community support this year, so it doesn't seem
productive at this point to do anything more than put the onus on the user
to read the documentation and heed any warning messages in the logs.
2019-03-06 15:54:15 +09:00
Ian Barwick 1615353f48 repmgrd: optionally disconnect WAL receivers during failover
This is intended to ensure that all nodes have a constant LSN while
making the failover decision.

This feature is experimental and needs to be explicitly enabled with the
configuration file option "standby_disconnect_on_failover".

Note enabling this option will result in a delay in the failover decision
until the WAL receiver is disconnected on all nodes.
2019-03-06 15:53:57 +09:00
Ian Barwick dd04ebb809 repmgrd: handle reconnect to restarted server when using "connection" checks 2019-03-06 14:54:05 +09:00
Ian Barwick b4dcda37a1 *_transaction() functions: log error message text as DETAIL
Per behaviour elsewhere.
2019-03-06 12:12:47 +09:00
Ian Barwick 63f7ad546e repmgrd: add option "connection_check_type"
This enable selection of the method repmgrd uses to check whether the upstream
node is available. Possible values are:

 - "ping" (default): uses PQping() to check server availability
 - "connection":  executes a query on the connection to check server
   availability (similar to repmgr3.x).
2019-03-06 12:09:54 +09:00
Ian Barwick 4f83111033 repmgrd: ignore invalid "upstream_last_seen" value 2019-03-05 11:00:29 +09:00
Ian Barwick 92103c5338 Use appendPQExpBufferStr where approrpriate 2019-03-01 16:42:00 +09:00
Ian Barwick 4b89cbd98d Rename "..._primary_last_seen" functions to "..._upstream_last_seen"
As that better reflects what they do.
2019-02-28 15:36:55 +09:00
Ian Barwick 0330fa6e62 daemon status: with csv output, show repmgrd status as unknown where appropriate
Previously, if PostgreSQL was not running on the node, repmgrd and
pause status were shown as "0", implying their status was known.

This brings the csv output in line with the human-readable output,
which displays "n/a" in this case.
2019-02-28 12:27:39 +09:00
Ian Barwick 4006f8af3c doc: upate release notes 2019-02-28 10:01:51 +09:00
Ian Barwick b1875a8d91 Split command execution functions into separate library
These may need to be executed by repmgrd.
2019-02-27 14:41:17 +09:00
Ian Barwick 5c2264eb8d Update .gitignore
Ignore artefacts from failed patch application.
2019-02-27 13:02:30 +09:00
Ian Barwick a6c16541c2 doc: tweak wording in event notification documentation 2019-02-27 13:01:19 +09:00
Ian Barwick 790a1cc492 repmgrd: add additional logging during a failover operation 2019-02-27 11:46:05 +09:00
Ian Barwick 067ed82931 Remove unneeded debugging output 2019-02-26 21:16:11 +09:00
Ian Barwick 59f32d74df doc: update introductory blurb 2019-02-26 15:16:46 +09:00
Ian Barwick 0578053875 standby clone: check upstream connections after data copy operation
With long-running copy operations, it's possible the connection(s) to
the primary/source server may go away for some reason, so recheck
their availability before attempting to reuse.
2019-02-26 14:37:05 +09:00
John Naylor 897e3bee14 Doc fix: PostgreSQL 9.4 is no longer considered recent 2019-02-24 12:44:10 +07:00
John Naylor 4e414d2ea0 Fix typo 2019-02-24 10:50:09 +07:00
Ian Barwick ea36609159 Add some missing query error logging 2019-02-23 16:54:07 +09:00
Ian Barwick 0c68018631 repmgrd: log details of nodes which can see primary
If a failover is cancelled because other nodes can still see the primary,
log the identies of those nodes.
2019-02-23 15:55:06 +09:00
Ian Barwick b72c894db4 repmgrd: during failover, check if other nodes have seen the primary
In a situation where only some standbys are cut off from the primary,
a failover would result in a split brain/split cluster situation,
as it's likely one of the cut-off standbys will promote itself, and
other cut-off standbys (but not all standbys) will follow it.

To prevent this happening, interrogate the other sibiling nodes to
check whether they've seen the primary within a reasonably short interval;
if this is the case, do not take any failover action.

This feature is experimental.
2019-02-23 13:03:22 +09:00
Ian Barwick 07097575b1 daemon status: add column "upstream last seen"
This displays the interval (in seconds) since the repmgrd instance on
each node last confirmed its upstream node is available.
2019-02-23 13:03:16 +09:00