repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-23 15:16:29 +00:00

Author	SHA1	Message	Date
Ian Barwick	3231b5034d	Remove temporary debugging log output	2019-04-24 13:17:52 +09:00
Ian Barwick	58b33fb411	Clarify a couple of code comments	2019-04-24 10:55:53 +09:00
Ian Barwick	6cbf436bf8	Don't execute "child_nodes_disconnect_command" when repmgrd paused	2019-04-23 14:08:13 +09:00
Ian Barwick	5a90513878	repmgrd: monitor standbys attached to primary This functionality enables repmgrd (when running on the primary) to monitor connected child nodes. It will log connections and disconnections and generate events. Additionally, repmgrd can execute a custom script if the number of connected child nodes falls below a configurable threshold. This script can be used e.g. to "fence" the primary following a failover situation where a new primary has been promoted and all standbys are now child nodes of that primary.	2019-04-22 16:18:52 +09:00
Ian Barwick	a0c6cb602f	repmgrd: remove duplicate function definition	2019-04-16 10:53:05 +09:00
Ian Barwick	27803f93ff	repmgrd: always unset upstream node ID when monitoring a primary	2019-04-12 12:26:39 +09:00
Ian Barwick	46d17d0933	repmgrd: fix log output	2019-04-11 16:29:08 +09:00
Ian Barwick	6b79e08706	repmgrd: add addiitonal log output in do_election()	2019-04-11 15:46:20 +09:00
Ian Barwick	cd6a55c7cb	repmgrd: improve primary visibility consensus check Exclude sibling nodes which report they're following a different node. This shouldn't happen, but could.	2019-04-11 15:46:14 +09:00
Ian Barwick	008bd00a59	repmgrd: store upstream node ID in shared memory	2019-04-11 15:46:09 +09:00
Ian Barwick	5a8741199f	repmgrd: exclude witness server from followability check	2019-04-11 11:19:12 +09:00
Ian Barwick	9164d3931b	repmgrd: clean up PQExpBuffer handling Unless the PQExpBuffer is required for the duration of the function, ensure it's always a variable local to the relevant code block. This mitigates the risk of accidentally accessing a generically named PQExpBuffer which hasn't been initialised or was previously terminated.	2019-03-26 13:15:25 +09:00
Ian Barwick	801ed2b0c8	repmgrd: don't terminate uninitialized PQExpBuffer	2019-03-26 11:35:45 +09:00
Ian Barwick	539861cb58	repmgrd: during failover, check if a node was already promoted Previously, repmgrd assumed that during a failover, there would not already be another primary node. However it's possible a node was promoted manually. While this is not a desirable situation, it's conceivable this could happen in the wild, so we should check for it and react accordingly. Also sanity-check that the follow target can actually be followed. Addresses issue raised in GitHub #420.	2019-03-22 14:06:41 +09:00
Ian Barwick	7434cc0b8e	repmgrd: improve witness node monitoring Mainly fix a couple of places where "standby" was hard-coded into a log message which can apply either to a witness or a standby.	2019-03-20 11:47:36 +09:00
Ian Barwick	46efe57cd0	Improve database connection failure logging Log the output of PQerrorStatus() in a couple of places where it was missing. Additionally, always log the output of PQerrorStatus() starting with a blank line, otherwise the first line looks like it was emitted by repmgr, and it's harder to scan the error message. Before: [2019-03-20 11:24:15] [DETAIL] could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5501? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5501? After: [2019-03-20 11:27:21] [DETAIL] could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5501? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5501?	2019-03-20 11:47:28 +09:00
Ian Barwick	426759ca8e	check_primary_status(): handle case where recovery type unknown	2019-03-18 16:16:54 +09:00
Ian Barwick	8ab51c2ae3	Refactor check_primary_status() Reduce nested if/else branching, and improve documentation.	2019-03-18 15:01:21 +09:00
Ian Barwick	43f28f4097	Clarify calls to check_primary_status() Use a constant rather than a magic number to indicate non-provision of elapsed degraded monitoring time.	2019-03-18 14:21:34 +09:00
Ian Barwick	c2206b007a	repmgrd: optionally check upstream availability through connection attempts	2019-03-14 15:44:53 +09:00
Ian Barwick	19bf4d7434	Count witness and zero-priority nodes in visibility check	2019-03-14 11:17:51 +09:00
Ian Barwick	56d9f5b856	Ensure witness node sets last upstream seen time	2019-03-14 10:53:47 +09:00
Ian Barwick	c3c58df7b9	repmgrd: improve logging output when executing "failover_validate_command"	2019-03-13 21:07:26 +09:00
Ian Barwick	573d027db6	repmgrd: various minor logging improvements	2019-03-13 11:27:17 +09:00
Ian Barwick	1afb41647b	repmgrd: remove global variable Make the "sibling_nodes" local, and pass by reference where relevant.	2019-03-12 17:12:23 +09:00
Ian Barwick	fc397f25f6	repmgrd: enable election rerun If "failover_validation_command" is set, and the command returns an error, rerun the election. There is a pause between reruns to avoid "churn"; the length of this pause is controlled by the configuration parameter "election_rerun_interval".	2019-03-12 17:12:19 +09:00
Ian Barwick	4ef706c2ca	Execute "failover_validation_command" when only one standby exists	2019-03-08 12:19:37 +09:00
Ian Barwick	db0d71c6a7	Initial implementation of "failover_validation_command"	2019-03-08 08:49:15 +09:00
Ian Barwick	33fefd9f52	Add configuration option "primary_visibility_consensus" This determines whether repmgrd should continue with a failover if one or more nodes report they can still see the standby.	2019-03-07 10:41:42 +09:00
Ian Barwick	a3f90d2bba	Add configuration option "sibling_nodes_disconnect_timeout" This controls the maximum length of time in seconds that repmgrd will wait for other standbys to disconnect their WAL receivers in a failover situation. This setting is only used when "standby_disconnect_on_failover" is set to "true".	2019-03-06 15:56:21 +09:00
Ian Barwick	2ed044c358	Reset "wal_retrieve_retry_interval" for all nodes	2019-03-06 15:55:03 +09:00
Ian Barwick	9823978f41	repmgrd: don't wait for WAL receiver to reconnect during failover If the WAL receiver has been temporarily disabled, we don't want to wait for it to start up as it may not be able to at that point; we do however need to reset "wal_retrieve_retry_interval".	2019-03-06 15:54:56 +09:00
Ian Barwick	f85b4cd98e	Log warning if "standby_disconnect_on_failover" used on pre-9.5 "standby_disconnect_on_failover" requires availability of "wal_retrieve_retry_interval", which is available from PostgreSQL 9.5. 9.4 will fall out of community support this year, so it doesn't seem productive at this point to do anything more than put the onus on the user to read the documentation and heed any warning messages in the logs.	2019-03-06 15:54:15 +09:00
Ian Barwick	1615353f48	repmgrd: optionally disconnect WAL receivers during failover This is intended to ensure that all nodes have a constant LSN while making the failover decision. This feature is experimental and needs to be explicitly enabled with the configuration file option "standby_disconnect_on_failover". Note enabling this option will result in a delay in the failover decision until the WAL receiver is disconnected on all nodes.	2019-03-06 15:53:57 +09:00
Ian Barwick	dd04ebb809	repmgrd: handle reconnect to restarted server when using "connection" checks	2019-03-06 14:54:05 +09:00
Ian Barwick	63f7ad546e	repmgrd: add option "connection_check_type" This enable selection of the method repmgrd uses to check whether the upstream node is available. Possible values are: - "ping" (default): uses PQping() to check server availability - "connection": executes a query on the connection to check server availability (similar to repmgr3.x).	2019-03-06 12:09:54 +09:00
Ian Barwick	4f83111033	repmgrd: ignore invalid "upstream_last_seen" value	2019-03-05 11:00:29 +09:00
Ian Barwick	4b89cbd98d	Rename "..._primary_last_seen" functions to "..._upstream_last_seen" As that better reflects what they do.	2019-02-28 15:36:55 +09:00
Ian Barwick	790a1cc492	repmgrd: add additional logging during a failover operation	2019-02-27 11:46:05 +09:00
Ian Barwick	0c68018631	repmgrd: log details of nodes which can see primary If a failover is cancelled because other nodes can still see the primary, log the identies of those nodes.	2019-02-23 15:55:06 +09:00
Ian Barwick	b72c894db4	repmgrd: during failover, check if other nodes have seen the primary In a situation where only some standbys are cut off from the primary, a failover would result in a split brain/split cluster situation, as it's likely one of the cut-off standbys will promote itself, and other cut-off standbys (but not all standbys) will follow it. To prevent this happening, interrogate the other sibiling nodes to check whether they've seen the primary within a reasonably short interval; if this is the case, do not take any failover action. This feature is experimental.	2019-02-23 13:03:22 +09:00
Ian Barwick	f1667a7e98	repmgrd: don't consider nodes where repmgrd is not running If, for whatever reason, repmgrd is not running on a node, but that node qualifies as promotion candidate, failover will not take place as that node will never promote itself. We therefore discount nodes where repmgrd is running as promotion candidates, which will ensure one node is always promoted. There is a slight risk here that the node(s) where repmgrd is not running are further ahead, leading to a timeline fork. It might be possible to mitigate that by having the "election" leader perform the promote (or follow) operation.	2019-02-07 17:07:13 +09:00
Ian Barwick	c4332d9a52	repmgrd: forcibly resume WAL replay if paused If WAL replay is paused, and there is WAL pending replay, a promote command will be queued until replay is resumed. As it's conceivable that there are corner cases where one standby with replay paused has actually received the most WAL, we'll forcibly resume WAL replay so it can be reliably promoted, if needed. Related to GitHub #540.	2019-02-07 11:39:48 +09:00
Ian Barwick	b9cd321aed	repmgrd: skip LSN checks of 0 priority node The node will never become a candidate so we can save the round trip to fetch its LSN.	2019-02-06 14:27:01 +09:00
Ian Barwick	cd3312496e	Rename functions which return an LSN for clarity	2019-02-06 09:32:53 +09:00
Ian Barwick	f9a1861ded	Refactor ReplInfo struct handling Eventually we'll want to have this contain the optional replication info contained in the t_node_info struct, which should then contain a pointer to a ReplInfo struct.	2019-02-02 18:39:24 +09:00
Ian Barwick	efe4a9c344	repmgrd: log receipt of SIGINT/SIGTERM	2019-01-23 13:44:59 +09:00
Ian Barwick	1980deb480	repmgrd: check for a change to the upstream node If the upstream node has changed, for example after "repmgr standby follow" was manually executed, restart monitoring to ensure repmgrd is monitoring the correct node.	2019-01-22 13:33:13 +09:00
Ian Barwick	b6fe91ebcd	repmgrd: track status of local (standby) node If the local node is not available, note the degraded monitoring status.	2019-01-22 10:36:22 +09:00
Ian Barwick	44cbb44500	repmgrd: improve logging output for standby monitoring	2019-01-22 10:36:14 +09:00

1 2 3 4

155 Commits