Commit Graph

1240 Commits

Author SHA1 Message Date
Ian Barwick 70a7b45a03 doc: add explanation of the configuration file format 2019-03-15 14:07:19 +09:00
Ian Barwick 4251590833 doc: update "connection_check_type" descriptions 2019-03-15 14:07:13 +09:00
Ian Barwick 9347d34ce0 repmgrd: optionally check upstream availability through connection attempts 2019-03-15 14:07:08 +09:00
John Naylor feb90ee50c Correct some doc typos 2019-03-15 14:07:05 +09:00
Ian Barwick 0a6486bb7f doc: expand "standby_disconnect_on_failover" documentation 2019-03-15 14:07:01 +09:00
Ian Barwick 39443bbcee Count witness and zero-priority nodes in visibility check 2019-03-15 14:06:58 +09:00
Ian Barwick fc636b1bd2 Ensure witness node sets last upstream seen time 2019-03-15 14:06:55 +09:00
Ian Barwick 048bad1c88 doc: fix option name typo 2019-03-15 14:06:51 +09:00
Ian Barwick 4528eb1796 doc: expand "failover_validate_command" documentation 2019-03-15 14:06:37 +09:00
Ian Barwick 169c9ccd32 repmgrd: improve logging output when executing "failover_validate_command" 2019-03-15 14:06:34 +09:00
Ian Barwick 5f92fbddf2 doc: various updates 2019-03-15 14:06:30 +09:00
Ian Barwick 617e466f72 doc: merge repmgrd witness server description into failover section 2019-03-13 16:19:41 +09:00
Ian Barwick 435fac297b doc: merge repmgrd split network handling description into failover section 2019-03-13 16:19:37 +09:00
Ian Barwick 4bc12b4c94 doc: merge repmgrd monitoring description into operating section 2019-03-13 16:19:33 +09:00
Ian Barwick 91234994e2 doc: merge repmgrd degraded monitoring description into operation section 2019-03-13 16:19:30 +09:00
Ian Barwick ee9da30f20 doc: merge repmgrd notes into operation documentation 2019-03-13 16:19:27 +09:00
Ian Barwick 2e67bc1341 doc: merge repmgrd pause documentation into overview 2019-03-13 16:19:24 +09:00
Ian Barwick 18ab5cab4e doc: initial repmgrd doc refactoring 2019-03-13 16:19:20 +09:00
Ian Barwick 60bb4e9fc8 doc: update repmgrd configuration documentation 2019-03-13 16:19:17 +09:00
Ian Barwick 52bee6b98d repmgrd: various minor logging improvements 2019-03-13 16:19:13 +09:00
Ian Barwick ecb1f379f5 repmgrd: remove global variable
Make the "sibling_nodes" local, and pass by reference where relevant.
2019-03-13 16:19:10 +09:00
Ian Barwick e1cd2c22d4 repmgrd: enable election rerun
If "failover_validation_command" is set, and the command returns an error,
rerun the election.

There is a pause between reruns to avoid "churn"; the length of this pause
is controlled by the configuration parameter "election_rerun_interval".
2019-03-13 16:19:03 +09:00
Ian Barwick 1dea6b76d9 Remove redundant struct allocation 2019-03-13 16:19:00 +09:00
Ian Barwick 702f90fc9d doc: update list of reloadable repmgrd configuration options 2019-03-13 16:18:56 +09:00
Ian Barwick c4d1eec6f3 doc: document "failover_validation_command" 2019-03-13 16:18:53 +09:00
Ian Barwick b241c606c0 doc: expand repmgrd configuration section 2019-03-13 16:18:50 +09:00
Ian Barwick 45c896d716 Execute "failover_validation_command" when only one standby exists 2019-03-08 15:29:17 +09:00
Ian Barwick 514595ea10 Make "failover_validation_command" reloadable 2019-03-08 15:29:12 +09:00
Ian Barwick 531194fa27 Initial implementation of "failover_validation_command" 2019-03-08 15:29:06 +09:00
Ian Barwick 2aa67c992c Make recently added configuration options reloadable 2019-03-08 15:28:59 +09:00
Ian Barwick 37892afcfc Add configuration option "primary_visibility_consensus"
This determines whether repmgrd should continue with a failover if
one or more nodes report they can still see the standby.
2019-03-08 15:28:53 +09:00
Ian Barwick e4e5e35552 Add configuration option "sibling_nodes_disconnect_timeout"
This controls the maximum length of time in seconds that repmgrd will
wait for other standbys to disconnect their WAL receivers in a failover
situation.

This setting is only used when "standby_disconnect_on_failover" is set to "true".
2019-03-08 15:28:48 +09:00
Ian Barwick b320c1f0ae Reset "wal_retrieve_retry_interval" for all nodes 2019-03-08 15:28:42 +09:00
Ian Barwick 280654bed6 repmgrd: don't wait for WAL receiver to reconnect during failover
If the WAL receiver has been temporarily disabled, we don't want to
wait for it to start up as it may not be able to at that point; we do
however need to reset "wal_retrieve_retry_interval".
2019-03-08 15:28:27 +09:00
Ian Barwick ae675059c0 Improve logging/sanity checking for "node control" options 2019-03-08 15:28:22 +09:00
Ian Barwick 454ebabe89 Improve logging when disabling/enabling WAL receiver
Also check action is being run on node which is in recovery.
2019-03-08 15:28:17 +09:00
Ian Barwick d1d6ef8d12 Check for WAL receiver start up 2019-03-08 15:28:11 +09:00
Ian Barwick 5d6eab74f6 Log warning if "standby_disconnect_on_failover" used on pre-9.5
"standby_disconnect_on_failover" requires availability of "wal_retrieve_retry_interval",
which is available from PostgreSQL 9.5.

9.4 will fall out of community support this year, so it doesn't seem
productive at this point to do anything more than put the onus on the user
to read the documentation and heed any warning messages in the logs.
2019-03-08 15:28:01 +09:00
Ian Barwick 59b7453bbf repmgrd: optionally disconnect WAL receivers during failover
This is intended to ensure that all nodes have a constant LSN while
making the failover decision.

This feature is experimental and needs to be explicitly enabled with the
configuration file option "standby_disconnect_on_failover".

Note enabling this option will result in a delay in the failover decision
until the WAL receiver is disconnected on all nodes.
2019-03-08 15:27:54 +09:00
Ian Barwick bde8c7e29c repmgrd: handle reconnect to restarted server when using "connection" checks 2019-03-08 15:27:49 +09:00
Ian Barwick bc6584a90d *_transaction() functions: log error message text as DETAIL
Per behaviour elsewhere.
2019-03-06 13:23:57 +09:00
Ian Barwick 074d79b44f repmgrd: add option "connection_check_type"
This enable selection of the method repmgrd uses to check whether the upstream
node is available. Possible values are:

 - "ping" (default): uses PQping() to check server availability
 - "connection":  executes a query on the connection to check server
   availability (similar to repmgr3.x).
2019-03-06 13:23:53 +09:00
Ian Barwick 2eeb288573 repmgrd: ignore invalid "upstream_last_seen" value 2019-03-06 13:23:47 +09:00
Ian Barwick 48a2274b11 Use appendPQExpBufferStr where approrpriate 2019-03-06 13:23:38 +09:00
Ian Barwick 19bcfa7264 Rename "..._primary_last_seen" functions to "..._upstream_last_seen"
As that better reflects what they do.
2019-03-06 13:23:33 +09:00
Ian Barwick 486877c3d5 repmgrd: log details of nodes which can see primary
If a failover is cancelled because other nodes can still see the primary,
log the identies of those nodes.
2019-03-06 13:23:27 +09:00
Ian Barwick 9753bcc8c3 repmgrd: during failover, check if other nodes have seen the primary
In a situation where only some standbys are cut off from the primary,
a failover would result in a split brain/split cluster situation,
as it's likely one of the cut-off standbys will promote itself, and
other cut-off standbys (but not all standbys) will follow it.

To prevent this happening, interrogate the other sibiling nodes to
check whether they've seen the primary within a reasonably short interval;
if this is the case, do not take any failover action.

This feature is experimental.
2019-03-06 13:23:22 +09:00
Ian Barwick bd35b450da daemon status: with csv output, show repmgrd status as unknown where appropriate
Previously, if PostgreSQL was not running on the node, repmgrd and
pause status were shown as "0", implying their status was known.

This brings the csv output in line with the human-readable output,
which displays "n/a" in this case.
2019-02-28 12:28:04 +09:00
Ian Barwick 1f256d4d73 doc: upate release notes 2019-02-28 10:02:05 +09:00
Ian Barwick 1524e2449f Split command execution functions into separate library
These may need to be executed by repmgrd.
2019-02-27 14:41:38 +09:00