repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-24 07:36:30 +00:00

Author	SHA1	Message	Date
Ian Barwick	39443bbcee	Count witness and zero-priority nodes in visibility check	2019-03-15 14:06:58 +09:00
Ian Barwick	fc636b1bd2	Ensure witness node sets last upstream seen time	2019-03-15 14:06:55 +09:00
Ian Barwick	169c9ccd32	repmgrd: improve logging output when executing "failover_validate_command"	2019-03-15 14:06:34 +09:00
Ian Barwick	52bee6b98d	repmgrd: various minor logging improvements	2019-03-13 16:19:13 +09:00
Ian Barwick	ecb1f379f5	repmgrd: remove global variable Make the "sibling_nodes" local, and pass by reference where relevant.	2019-03-13 16:19:10 +09:00
Ian Barwick	e1cd2c22d4	repmgrd: enable election rerun If "failover_validation_command" is set, and the command returns an error, rerun the election. There is a pause between reruns to avoid "churn"; the length of this pause is controlled by the configuration parameter "election_rerun_interval".	2019-03-13 16:19:03 +09:00
Ian Barwick	45c896d716	Execute "failover_validation_command" when only one standby exists	2019-03-08 15:29:17 +09:00
Ian Barwick	531194fa27	Initial implementation of "failover_validation_command"	2019-03-08 15:29:06 +09:00
Ian Barwick	37892afcfc	Add configuration option "primary_visibility_consensus" This determines whether repmgrd should continue with a failover if one or more nodes report they can still see the standby.	2019-03-08 15:28:53 +09:00
Ian Barwick	e4e5e35552	Add configuration option "sibling_nodes_disconnect_timeout" This controls the maximum length of time in seconds that repmgrd will wait for other standbys to disconnect their WAL receivers in a failover situation. This setting is only used when "standby_disconnect_on_failover" is set to "true".	2019-03-08 15:28:48 +09:00
Ian Barwick	b320c1f0ae	Reset "wal_retrieve_retry_interval" for all nodes	2019-03-08 15:28:42 +09:00
Ian Barwick	280654bed6	repmgrd: don't wait for WAL receiver to reconnect during failover If the WAL receiver has been temporarily disabled, we don't want to wait for it to start up as it may not be able to at that point; we do however need to reset "wal_retrieve_retry_interval".	2019-03-08 15:28:27 +09:00
Ian Barwick	5d6eab74f6	Log warning if "standby_disconnect_on_failover" used on pre-9.5 "standby_disconnect_on_failover" requires availability of "wal_retrieve_retry_interval", which is available from PostgreSQL 9.5. 9.4 will fall out of community support this year, so it doesn't seem productive at this point to do anything more than put the onus on the user to read the documentation and heed any warning messages in the logs.	2019-03-08 15:28:01 +09:00
Ian Barwick	59b7453bbf	repmgrd: optionally disconnect WAL receivers during failover This is intended to ensure that all nodes have a constant LSN while making the failover decision. This feature is experimental and needs to be explicitly enabled with the configuration file option "standby_disconnect_on_failover". Note enabling this option will result in a delay in the failover decision until the WAL receiver is disconnected on all nodes.	2019-03-08 15:27:54 +09:00
Ian Barwick	bde8c7e29c	repmgrd: handle reconnect to restarted server when using "connection" checks	2019-03-08 15:27:49 +09:00
Ian Barwick	074d79b44f	repmgrd: add option "connection_check_type" This enable selection of the method repmgrd uses to check whether the upstream node is available. Possible values are: - "ping" (default): uses PQping() to check server availability - "connection": executes a query on the connection to check server availability (similar to repmgr3.x).	2019-03-06 13:23:53 +09:00
Ian Barwick	2eeb288573	repmgrd: ignore invalid "upstream_last_seen" value	2019-03-06 13:23:47 +09:00
Ian Barwick	19bcfa7264	Rename "..._primary_last_seen" functions to "..._upstream_last_seen" As that better reflects what they do.	2019-03-06 13:23:33 +09:00
Ian Barwick	486877c3d5	repmgrd: log details of nodes which can see primary If a failover is cancelled because other nodes can still see the primary, log the identies of those nodes.	2019-03-06 13:23:27 +09:00
Ian Barwick	9753bcc8c3	repmgrd: during failover, check if other nodes have seen the primary In a situation where only some standbys are cut off from the primary, a failover would result in a split brain/split cluster situation, as it's likely one of the cut-off standbys will promote itself, and other cut-off standbys (but not all standbys) will follow it. To prevent this happening, interrogate the other sibiling nodes to check whether they've seen the primary within a reasonably short interval; if this is the case, do not take any failover action. This feature is experimental.	2019-03-06 13:23:22 +09:00
Ian Barwick	0cd2bd2e91	repmgrd: add additional logging during a failover operation	2019-02-27 11:45:34 +09:00
Ian Barwick	f1667a7e98	repmgrd: don't consider nodes where repmgrd is not running If, for whatever reason, repmgrd is not running on a node, but that node qualifies as promotion candidate, failover will not take place as that node will never promote itself. We therefore discount nodes where repmgrd is running as promotion candidates, which will ensure one node is always promoted. There is a slight risk here that the node(s) where repmgrd is not running are further ahead, leading to a timeline fork. It might be possible to mitigate that by having the "election" leader perform the promote (or follow) operation.	2019-02-07 17:07:13 +09:00
Ian Barwick	c4332d9a52	repmgrd: forcibly resume WAL replay if paused If WAL replay is paused, and there is WAL pending replay, a promote command will be queued until replay is resumed. As it's conceivable that there are corner cases where one standby with replay paused has actually received the most WAL, we'll forcibly resume WAL replay so it can be reliably promoted, if needed. Related to GitHub #540.	2019-02-07 11:39:48 +09:00
Ian Barwick	b9cd321aed	repmgrd: skip LSN checks of 0 priority node The node will never become a candidate so we can save the round trip to fetch its LSN.	2019-02-06 14:27:01 +09:00
Ian Barwick	cd3312496e	Rename functions which return an LSN for clarity	2019-02-06 09:32:53 +09:00
Ian Barwick	f9a1861ded	Refactor ReplInfo struct handling Eventually we'll want to have this contain the optional replication info contained in the t_node_info struct, which should then contain a pointer to a ReplInfo struct.	2019-02-02 18:39:24 +09:00
Ian Barwick	efe4a9c344	repmgrd: log receipt of SIGINT/SIGTERM	2019-01-23 13:44:59 +09:00
Ian Barwick	1980deb480	repmgrd: check for a change to the upstream node If the upstream node has changed, for example after "repmgr standby follow" was manually executed, restart monitoring to ensure repmgrd is monitoring the correct node.	2019-01-22 13:33:13 +09:00
Ian Barwick	b6fe91ebcd	repmgrd: track status of local (standby) node If the local node is not available, note the degraded monitoring status.	2019-01-22 10:36:22 +09:00
Ian Barwick	44cbb44500	repmgrd: improve logging output for standby monitoring	2019-01-22 10:36:14 +09:00
Ian Barwick	7dce3ed234	Update copyright notices to 2019	2019-01-21 14:54:35 +09:00
Ian Barwick	58efb0f158	repmgrd: on a cascaded standby, don't fail over if "failover=manual" Addresses GitHub #531.	2019-01-21 14:16:49 +09:00
Ian Barwick	a6a2be2239	Teach witness repmgrd to deal with the absence of a primary Previously it would refuse to start if the primary was not reachable, the thinking being that it's pointless trying to monitor an incomplete cluster. However following an aborted failover situation, repmgrd will restart monitoring and on the witness server, this will lead to it aborting itself due to to continuing absence of primary. To resolve this, witness repmgrd will now start monitoring in degraded mode if no primary is found in the hope a primary will reappear at some point.	2018-11-29 12:15:41 +09:00
Ian Barwick	0caec90d81	repmgrd: set primary last seen	2018-11-21 11:30:27 +09:00
Ian Barwick	e0d6d906e7	repmgrd: fix upstream role check Only take action if it's confirmed as a standby.	2018-10-23 12:47:55 +09:00
Ian Barwick	578f11003c	repmgrd: improve node role change detection	2018-10-19 11:25:11 +09:00
Ian Barwick	62ac56c3f5	repmgrd: handle case where upstream is no longer primary If the upstream comes back on line (e.g. after a switchover), and its status is no longer primary, restart monitoring to ensure the correct primary (potentially the current node) is being monitored.	2018-10-18 16:50:13 +09:00
Ian Barwick	c79852cce0	Ensure witness repmgrd detects change in upstream's role This ensures that e.g. after a switchover, repmgrd running on a witness node will automatically detect the new primary and monitor that.	2018-10-18 16:15:46 +09:00
Ian Barwick	3907a545b0	repmgrd: ensure witness node doesn't try and follow another witness Theoretically there should never be more than one witness node visible here, but it's not impossible to rule it out, so add a check just in case.	2018-10-18 12:17:06 +09:00
Ian Barwick	b2348c9a70	repmgrd: improve promotion script failure handling While scanning for a new primary following a promotion script failure, repmgrd was treating a witness server as a potential new primary and would attempt to "follow" it. Fortunately "repmgr standby follow" would do the right thing and choose the actual primary, if available, otherwise do nothing, so the cluster would eventually end up in the correct state, albeit for the wrong reason. By skipping the witness server as a potential new primary, repmgrd will do the right thing if the original primary does come back online, i.e. resume monitoring as before.	2018-10-16 11:42:54 +09:00
Ian Barwick	3e38759c02	use appendPQExpBufferStr/-Char() consistently	2018-10-04 08:42:42 +09:00
Ian Barwick	2491b8ae52	Add functionality to "pause" repmgrd In some circumstances, e.g. while performing a switchover, it is essential that repmgrd does not take any kind of failover action, as this will put the cluster into an incorrect state. Previously it was necessary to stop repmgrd on all nodes (or at least those nodes which repmgrd would consider as promotion candidates), however this is a cumbersome and potentially risk-prone operation, particularly if the replication cluster contains more than a couple of servers. To prevent this issue from occurring, this patch introduces the ability to "pause" repmgrd on all nodes wth a single command ("repmgr daemon pause") which notifies repmgrd not to take any failover action until the node is "unpaused" ("repmgr daemon unpause"). "repmgr daemon status" provides an overview of each node and whether repmgrd is running, and if so whether it is paused. "repmgr standby switchover" has been modified to automatically pause repmgrd while carrying out the switchover. See documentation for further details.	2018-09-27 16:42:10 +09:00
Ian Barwick	1f8f6f3a39	repmgrd: add notice about different location preventing standby promotion Though we note this in the DEBUG output, it's not immediately obvious from the logs, especially outside of the DEBUG log level, why a node didn't promote itself if it is in a different location to the primary.	2018-09-27 11:06:18 +09:00
Ian Barwick	97905b02ae	repmgrd: fix comment	2018-09-13 10:15:22 +09:00
Ian Barwick	5de2b1ee13	repmgrd: update local node id in shared memory after local node restart Also ensure local node restarts are handled more elegantly, so we're not surprised by a stale connection handle. GitHub #502.	2018-09-07 11:59:53 +09:00
Ian Barwick	17e75f6b31	repmgrd: improve reconnection handling Previously, if the server being monitored was not available, repmgrd would always close the existing connection handle and open a new one. However, in some cases, e.g. a brief network outage, the existing connection handle is still good and does not need to be reopened. This could be particularly problematic if monitoring_history is on, as this risks leaving orphan sessions on the primary which (given a sufficiently unstable network) could lead to all available backends being occupied. Instead, during an outage we now use a new connection to verify the server is accessible; if the old connection is still available (e.g. following a short network interruption) we continue using that; if not (e.g. the server was restarted), we use the new one.	2018-08-30 15:46:08 +09:00
Ian Barwick	ceeb6d7130	repmgrd: improve monitoring statistics logging Add more granular logging to help diagnose issues, and also keep track of when the last monitoring statistics update was set and emit that as DETAIL every time we emit a log status update.	2018-08-30 12:36:59 +09:00
Ian Barwick	221fb63e92	repmgrd: fix startup on witness node when local data is stale Previously, when running on a witness server, repmgrd didn't consider the local cache of the "repmgr.nodes" table might be outdated, e.g. as repmgrd wasn't running on the witness server during a failover, so could potentially end up monitoring a former primary now running as a standby. When running on a witness server, at startup repmgrd will now scan all nodes to determine the current primary, and refresh its local cache from there. This will also ensure it can start up even if the node currently registered as primary in the local cache is not available. Implements GitHub #488 and #489.	2018-08-20 15:29:29 +09:00
Ian Barwick	bc584d84f6	repmgrd: improve cascaded standby failover handling In particular, improve handling of the case where the standby follow command fails due to the primary not being available. GitHub #480.	2018-08-20 15:23:54 +09:00
Ian Barwick	76f5bcf3cd	repmgrd: fix PQExpBuffer handling in upstream failover handler Was sometimes leading to blank log lines.	2018-08-20 15:23:50 +09:00

1 2 3

135 Commits