Commit Graph

1908 Commits

Author SHA1 Message Date
Ian Barwick 314a1e8f4f use a constant to denote unknown replication lag 2019-03-20 17:26:04 +09:00
Ian Barwick 7204a0faf4 doc: consolidate witness server documentation 2019-03-20 16:31:52 +09:00
Ian Barwick 5e775cef16 doc: various improvements to repmgrd documentation 2019-03-20 16:10:03 +09:00
Ian Barwick 7d0caefaee Fix logging related to "connection_check_type"
Also log the selected type at repmgrd startup.
2019-03-20 11:58:18 +09:00
Ian Barwick 7434cc0b8e repmgrd: improve witness node monitoring
Mainly fix a couple of places where "standby" was hard-coded into a log
message which can apply either to a witness or a standby.
2019-03-20 11:47:36 +09:00
Ian Barwick b84d98fe81 Explictly log PQping() failures 2019-03-20 11:47:32 +09:00
Ian Barwick 46efe57cd0 Improve database connection failure logging
Log the output of PQerrorStatus() in a couple of places where it was missing.

Additionally, always log the output of PQerrorStatus() starting with a blank
line, otherwise the first line looks like it was emitted by repmgr, and
it's harder to scan the error message.

Before:

    [2019-03-20 11:24:15] [DETAIL] could not connect to server: Connection refused
            Is the server running on host "localhost" (::1) and accepting
            TCP/IP connections on port 5501?
    could not connect to server: Connection refused
            Is the server running on host "localhost" (127.0.0.1) and accepting
            TCP/IP connections on port 5501?

After:

    [2019-03-20 11:27:21] [DETAIL]
    could not connect to server: Connection refused
            Is the server running on host "localhost" (::1) and accepting
            TCP/IP connections on port 5501?
    could not connect to server: Connection refused
            Is the server running on host "localhost" (127.0.0.1) and accepting
            TCP/IP connections on port 5501?
2019-03-20 11:47:28 +09:00
Ian Barwick 426759ca8e check_primary_status(): handle case where recovery type unknown 2019-03-18 16:16:54 +09:00
Ian Barwick 39df55c39c Check node recovery type before attempting to write an event record
In some corner cases (e.g. immediately after a switchover) where
the current primary has not yet been determined, the provided connection
might not be writeable. This prevents error messages such as
"cannot execute INSERT in a read-only transaction" generating unnecessary
noise in the logs.
2019-03-18 15:26:16 +09:00
Ian Barwick f54ff85cfa Remove outdated comment
This was only relevant for repmgr3 and earlier; in repmgr4 the schema
is hard-coded.
2019-03-18 15:19:11 +09:00
Ian Barwick 8ab51c2ae3 Refactor check_primary_status()
Reduce nested if/else branching, and improve documentation.
2019-03-18 15:01:21 +09:00
Ian Barwick 43f28f4097 Clarify calls to check_primary_status()
Use a constant rather than a magic number to indicate non-provision
of elapsed degraded monitoring time.
2019-03-18 14:21:34 +09:00
Ian Barwick 0940185f49 doc: clarify "cluster show" error codes 2019-03-18 10:49:38 +09:00
John Naylor 4f9fc56871 Fix assorted Makefile bugs
1. The target additional-maintainer-clean was misspelled as
maintainer-additional-clean.

2. Add add missing clean targets, in particular sysutils.o, config.h,
repmgr_version.h, and Makefile.global. While at it, use a wildcard
for obj files.

3. Don't delete configure.

4. Remove generated file doc/version.sgml from the repo.

5. Have maintainer-clean recurse to the doc directory.
2019-03-15 16:29:31 +09:00
Ian Barwick fbdf9617fa doc: update repmgrd example output 2019-03-15 15:43:11 +09:00
Ian Barwick dfb92df05f doc: miscellaenous cleanup 2019-03-15 14:39:37 +09:00
Ian Barwick 9dd87dd5ce doc: add explanation of the configuration file format 2019-03-15 14:02:42 +09:00
Ian Barwick a2df69512a doc: update "connection_check_type" descriptions 2019-03-14 15:44:59 +09:00
Ian Barwick c2206b007a repmgrd: optionally check upstream availability through connection attempts 2019-03-14 15:44:53 +09:00
John Naylor e06d3de444 Correct some doc typos 2019-03-14 11:58:31 +08:00
Ian Barwick 9d056b2f72 doc: expand "standby_disconnect_on_failover" documentation 2019-03-14 12:08:13 +09:00
Ian Barwick 19bf4d7434 Count witness and zero-priority nodes in visibility check 2019-03-14 11:17:51 +09:00
Ian Barwick 56d9f5b856 Ensure witness node sets last upstream seen time 2019-03-14 10:53:47 +09:00
Ian Barwick c1d6753081 doc: fix option name typo 2019-03-14 09:32:06 +09:00
Ian Barwick 2b59b4894a doc: expand "failover_validate_command" documentation 2019-03-13 21:10:03 +09:00
Ian Barwick c3c58df7b9 repmgrd: improve logging output when executing "failover_validate_command" 2019-03-13 21:07:26 +09:00
Ian Barwick 0e2f3e563a doc: various updates 2019-03-13 16:55:32 +09:00
Ian Barwick 8c4421d110 doc: merge repmgrd witness server description into failover section 2019-03-13 16:12:17 +09:00
Ian Barwick 69cb3f1e82 doc: merge repmgrd split network handling description into failover section 2019-03-13 16:12:14 +09:00
Ian Barwick 960acfeb3c doc: merge repmgrd monitoring description into operating section 2019-03-13 16:12:11 +09:00
Ian Barwick a8d50a5b98 doc: merge repmgrd degraded monitoring description into operation section 2019-03-13 16:12:06 +09:00
Ian Barwick 11e5993bf5 doc: merge repmgrd notes into operation documentation 2019-03-13 16:12:03 +09:00
Ian Barwick 09861a5604 doc: merge repmgrd pause documentation into overview 2019-03-13 16:11:59 +09:00
Ian Barwick 89bba77d4d doc: initial repmgrd doc refactoring 2019-03-13 16:11:55 +09:00
Ian Barwick dd6ece326f doc: update repmgrd configuration documentation 2019-03-13 13:34:08 +09:00
Ian Barwick 573d027db6 repmgrd: various minor logging improvements 2019-03-13 11:27:17 +09:00
Ian Barwick 1afb41647b repmgrd: remove global variable
Make the "sibling_nodes" local, and pass by reference where relevant.
2019-03-12 17:12:23 +09:00
Ian Barwick fc397f25f6 repmgrd: enable election rerun
If "failover_validation_command" is set, and the command returns an error,
rerun the election.

There is a pause between reruns to avoid "churn"; the length of this pause
is controlled by the configuration parameter "election_rerun_interval".
2019-03-12 17:12:19 +09:00
Ian Barwick 99923f5ffc Remove redundant struct allocation 2019-03-11 19:06:07 +09:00
Ian Barwick b9cdcd55e7 doc: update list of reloadable repmgrd configuration options 2019-03-11 16:18:10 +09:00
Ian Barwick db87ff46fd doc: document "failover_validation_command" 2019-03-11 15:02:33 +09:00
Ian Barwick 2a8f8d8400 doc: expand repmgrd configuration section 2019-03-11 14:50:33 +09:00
Ian Barwick 4ef706c2ca Execute "failover_validation_command" when only one standby exists 2019-03-08 12:19:37 +09:00
Ian Barwick 663c2e75b4 Make "failover_validation_command" reloadable 2019-03-08 09:27:19 +09:00
Ian Barwick db0d71c6a7 Initial implementation of "failover_validation_command" 2019-03-08 08:49:15 +09:00
Ian Barwick 6f4f56dd8c Make recently added configuration options reloadable 2019-03-07 10:58:25 +09:00
Ian Barwick 33fefd9f52 Add configuration option "primary_visibility_consensus"
This determines whether repmgrd should continue with a failover if
one or more nodes report they can still see the standby.
2019-03-07 10:41:42 +09:00
Ian Barwick a3f90d2bba Add configuration option "sibling_nodes_disconnect_timeout"
This controls the maximum length of time in seconds that repmgrd will
wait for other standbys to disconnect their WAL receivers in a failover
situation.

This setting is only used when "standby_disconnect_on_failover" is set to "true".
2019-03-06 15:56:21 +09:00
Ian Barwick 2ed044c358 Reset "wal_retrieve_retry_interval" for all nodes 2019-03-06 15:55:03 +09:00
Ian Barwick 9823978f41 repmgrd: don't wait for WAL receiver to reconnect during failover
If the WAL receiver has been temporarily disabled, we don't want to
wait for it to start up as it may not be able to at that point; we do
however need to reset "wal_retrieve_retry_interval".
2019-03-06 15:54:56 +09:00