repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-24 07:36:30 +00:00

Author	SHA1	Message	Date
Ian Barwick	074d79b44f	repmgrd: add option "connection_check_type" This enable selection of the method repmgrd uses to check whether the upstream node is available. Possible values are: - "ping" (default): uses PQping() to check server availability - "connection": executes a query on the connection to check server availability (similar to repmgr3.x).	2019-03-06 13:23:53 +09:00
Ian Barwick	2eeb288573	repmgrd: ignore invalid "upstream_last_seen" value	2019-03-06 13:23:47 +09:00
Ian Barwick	19bcfa7264	Rename "..._primary_last_seen" functions to "..._upstream_last_seen" As that better reflects what they do.	2019-03-06 13:23:33 +09:00
Ian Barwick	486877c3d5	repmgrd: log details of nodes which can see primary If a failover is cancelled because other nodes can still see the primary, log the identies of those nodes.	2019-03-06 13:23:27 +09:00
Ian Barwick	9753bcc8c3	repmgrd: during failover, check if other nodes have seen the primary In a situation where only some standbys are cut off from the primary, a failover would result in a split brain/split cluster situation, as it's likely one of the cut-off standbys will promote itself, and other cut-off standbys (but not all standbys) will follow it. To prevent this happening, interrogate the other sibiling nodes to check whether they've seen the primary within a reasonably short interval; if this is the case, do not take any failover action. This feature is experimental.	2019-03-06 13:23:22 +09:00
Ian Barwick	0cd2bd2e91	repmgrd: add additional logging during a failover operation	2019-02-27 11:45:34 +09:00
Ian Barwick	f1667a7e98	repmgrd: don't consider nodes where repmgrd is not running If, for whatever reason, repmgrd is not running on a node, but that node qualifies as promotion candidate, failover will not take place as that node will never promote itself. We therefore discount nodes where repmgrd is running as promotion candidates, which will ensure one node is always promoted. There is a slight risk here that the node(s) where repmgrd is not running are further ahead, leading to a timeline fork. It might be possible to mitigate that by having the "election" leader perform the promote (or follow) operation.	2019-02-07 17:07:13 +09:00
Ian Barwick	c4332d9a52	repmgrd: forcibly resume WAL replay if paused If WAL replay is paused, and there is WAL pending replay, a promote command will be queued until replay is resumed. As it's conceivable that there are corner cases where one standby with replay paused has actually received the most WAL, we'll forcibly resume WAL replay so it can be reliably promoted, if needed. Related to GitHub #540.	2019-02-07 11:39:48 +09:00
Ian Barwick	b9cd321aed	repmgrd: skip LSN checks of 0 priority node The node will never become a candidate so we can save the round trip to fetch its LSN.	2019-02-06 14:27:01 +09:00
Ian Barwick	cd3312496e	Rename functions which return an LSN for clarity	2019-02-06 09:32:53 +09:00
Ian Barwick	f9a1861ded	Refactor ReplInfo struct handling Eventually we'll want to have this contain the optional replication info contained in the t_node_info struct, which should then contain a pointer to a ReplInfo struct.	2019-02-02 18:39:24 +09:00
Ian Barwick	efe4a9c344	repmgrd: log receipt of SIGINT/SIGTERM	2019-01-23 13:44:59 +09:00
Ian Barwick	1980deb480	repmgrd: check for a change to the upstream node If the upstream node has changed, for example after "repmgr standby follow" was manually executed, restart monitoring to ensure repmgrd is monitoring the correct node.	2019-01-22 13:33:13 +09:00
Ian Barwick	b6fe91ebcd	repmgrd: track status of local (standby) node If the local node is not available, note the degraded monitoring status.	2019-01-22 10:36:22 +09:00
Ian Barwick	44cbb44500	repmgrd: improve logging output for standby monitoring	2019-01-22 10:36:14 +09:00
Ian Barwick	7dce3ed234	Update copyright notices to 2019	2019-01-21 14:54:35 +09:00
Ian Barwick	58efb0f158	repmgrd: on a cascaded standby, don't fail over if "failover=manual" Addresses GitHub #531.	2019-01-21 14:16:49 +09:00
Ian Barwick	a6a2be2239	Teach witness repmgrd to deal with the absence of a primary Previously it would refuse to start if the primary was not reachable, the thinking being that it's pointless trying to monitor an incomplete cluster. However following an aborted failover situation, repmgrd will restart monitoring and on the witness server, this will lead to it aborting itself due to to continuing absence of primary. To resolve this, witness repmgrd will now start monitoring in degraded mode if no primary is found in the hope a primary will reappear at some point.	2018-11-29 12:15:41 +09:00
Ian Barwick	0caec90d81	repmgrd: set primary last seen	2018-11-21 11:30:27 +09:00
Ian Barwick	e0d6d906e7	repmgrd: fix upstream role check Only take action if it's confirmed as a standby.	2018-10-23 12:47:55 +09:00
Ian Barwick	578f11003c	repmgrd: improve node role change detection	2018-10-19 11:25:11 +09:00
Ian Barwick	62ac56c3f5	repmgrd: handle case where upstream is no longer primary If the upstream comes back on line (e.g. after a switchover), and its status is no longer primary, restart monitoring to ensure the correct primary (potentially the current node) is being monitored.	2018-10-18 16:50:13 +09:00
Ian Barwick	c79852cce0	Ensure witness repmgrd detects change in upstream's role This ensures that e.g. after a switchover, repmgrd running on a witness node will automatically detect the new primary and monitor that.	2018-10-18 16:15:46 +09:00
Ian Barwick	3907a545b0	repmgrd: ensure witness node doesn't try and follow another witness Theoretically there should never be more than one witness node visible here, but it's not impossible to rule it out, so add a check just in case.	2018-10-18 12:17:06 +09:00
Ian Barwick	b2348c9a70	repmgrd: improve promotion script failure handling While scanning for a new primary following a promotion script failure, repmgrd was treating a witness server as a potential new primary and would attempt to "follow" it. Fortunately "repmgr standby follow" would do the right thing and choose the actual primary, if available, otherwise do nothing, so the cluster would eventually end up in the correct state, albeit for the wrong reason. By skipping the witness server as a potential new primary, repmgrd will do the right thing if the original primary does come back online, i.e. resume monitoring as before.	2018-10-16 11:42:54 +09:00
Ian Barwick	3e38759c02	use appendPQExpBufferStr/-Char() consistently	2018-10-04 08:42:42 +09:00
Ian Barwick	2491b8ae52	Add functionality to "pause" repmgrd In some circumstances, e.g. while performing a switchover, it is essential that repmgrd does not take any kind of failover action, as this will put the cluster into an incorrect state. Previously it was necessary to stop repmgrd on all nodes (or at least those nodes which repmgrd would consider as promotion candidates), however this is a cumbersome and potentially risk-prone operation, particularly if the replication cluster contains more than a couple of servers. To prevent this issue from occurring, this patch introduces the ability to "pause" repmgrd on all nodes wth a single command ("repmgr daemon pause") which notifies repmgrd not to take any failover action until the node is "unpaused" ("repmgr daemon unpause"). "repmgr daemon status" provides an overview of each node and whether repmgrd is running, and if so whether it is paused. "repmgr standby switchover" has been modified to automatically pause repmgrd while carrying out the switchover. See documentation for further details.	2018-09-27 16:42:10 +09:00
Ian Barwick	1f8f6f3a39	repmgrd: add notice about different location preventing standby promotion Though we note this in the DEBUG output, it's not immediately obvious from the logs, especially outside of the DEBUG log level, why a node didn't promote itself if it is in a different location to the primary.	2018-09-27 11:06:18 +09:00
Ian Barwick	97905b02ae	repmgrd: fix comment	2018-09-13 10:15:22 +09:00
Ian Barwick	5de2b1ee13	repmgrd: update local node id in shared memory after local node restart Also ensure local node restarts are handled more elegantly, so we're not surprised by a stale connection handle. GitHub #502.	2018-09-07 11:59:53 +09:00
Ian Barwick	17e75f6b31	repmgrd: improve reconnection handling Previously, if the server being monitored was not available, repmgrd would always close the existing connection handle and open a new one. However, in some cases, e.g. a brief network outage, the existing connection handle is still good and does not need to be reopened. This could be particularly problematic if monitoring_history is on, as this risks leaving orphan sessions on the primary which (given a sufficiently unstable network) could lead to all available backends being occupied. Instead, during an outage we now use a new connection to verify the server is accessible; if the old connection is still available (e.g. following a short network interruption) we continue using that; if not (e.g. the server was restarted), we use the new one.	2018-08-30 15:46:08 +09:00
Ian Barwick	ceeb6d7130	repmgrd: improve monitoring statistics logging Add more granular logging to help diagnose issues, and also keep track of when the last monitoring statistics update was set and emit that as DETAIL every time we emit a log status update.	2018-08-30 12:36:59 +09:00
Ian Barwick	221fb63e92	repmgrd: fix startup on witness node when local data is stale Previously, when running on a witness server, repmgrd didn't consider the local cache of the "repmgr.nodes" table might be outdated, e.g. as repmgrd wasn't running on the witness server during a failover, so could potentially end up monitoring a former primary now running as a standby. When running on a witness server, at startup repmgrd will now scan all nodes to determine the current primary, and refresh its local cache from there. This will also ensure it can start up even if the node currently registered as primary in the local cache is not available. Implements GitHub #488 and #489.	2018-08-20 15:29:29 +09:00
Ian Barwick	bc584d84f6	repmgrd: improve cascaded standby failover handling In particular, improve handling of the case where the standby follow command fails due to the primary not being available. GitHub #480.	2018-08-20 15:23:54 +09:00
Ian Barwick	76f5bcf3cd	repmgrd: fix PQExpBuffer handling in upstream failover handler Was sometimes leading to blank log lines.	2018-08-20 15:23:50 +09:00
Ian Barwick	b1aab930af	repmgrd: don't imply primary is in recovery if it's not available	2018-08-20 15:23:46 +09:00
Ian Barwick	58994365ff	repmgrd: fix "repmgrd_upstream_reconnect" event notification Upstream node is not always the primary node. Per report in GitHub #480.	2018-08-20 15:23:42 +09:00
Ian Barwick	b61f853a69	repmgrd: ensure primary connection handle is refreshed after reconnect In some circumstances, if monitoring history was in use, repmgrd was attempting to fetch the primary's current LSN on a stale connection handle.	2018-08-15 16:55:03 +09:00
Ian Barwick	44a224ad92	repmgrd: fix configuration file reloading Don't allow "promote_command" or "follow_command" to be empty. GitHub #486.	2018-08-02 16:35:26 +09:00
Ian Barwick	33dedf4e96	repmgrd: always reopen log file after receiving SIGHUP For whatever reason, since at least repmgr 2.0 the log file was only ever reopened if a configuration file change took place. GitHub #485.	2018-08-02 10:54:31 +09:00
Ian Barwick	a87f18682c	repmgrd: consolidate SIGHUP handling Move identical code blocks into single function.	2018-08-02 10:54:12 +09:00
Ian Barwick	bd58e4128c	repmgrd: log "promote_command" at log_level "INFO" If repmgrd is promoting the local node, it was only logging the contents of "promote_command" at DEBUG level; it would be useful to see this at the default log level. Related to GitHub #473.	2018-07-16 15:33:10 +09:00
Ian Barwick	63242e2277	doc: update documentation of "promote_command" and "service_promote_command" The documentation implied it would override "promote_command", which is not the case. "promote_command" is used by repmgrd to execute "repmgr standby promote" (either directly or via a custom script). "service_promote_command" can be set to specify a package-level service command to promote the local PostgreSQL instance from standby to primary, e.g. Debian's pg_ctlcluster. If set, this will be executed by "repmgr standby promote". Also update code comments to clarify usage. Related to GitHub #473.	2018-07-16 14:43:53 +09:00
Ian Barwick	17f30ec364	repmgrd: add additional local node connection check It's possible there are corner-cases where do_election() is called while the local connection is invalid, so perform an additional check.	2018-07-11 15:11:20 +09:00
Ian Barwick	b2081dca52	De-overload configuration file parameter "standby_reconnect_timeout" Currently the (very generic sounding) "standby_reconnect_timeout" configuration file parameter is used in several different contexts and it would be useful to have more granular control over the different timeouts it's used to configure. This patch introduces "node_rejoin_timeout", used in place of "standby_reconnect_timeout" (which wasn't documented) when "repmgr node rejoin" is executed, to determine how long to wait for the node to rejoin the replication cluster. Additionally "repmgrd_standby_startup_timeout" is introduced as a timeout for failover situations, when repmgrd executes "repmgr standby follow" to follow a new primary, and waits for the standby to restart and become available for connections. "standby_reconnect_timeout" is now only relevant for "repmgr standby switchover". Implements GitHub #454.	2018-06-28 18:00:55 +09:00
Ian Barwick	95fe7ea621	repmgrd: ensure local node is counted as quorum member Rename "standby_nodes" to "sibling_nodes" to make it clearer in the code what total is actually provided by the struct. Addresses GitHub #439.	2018-06-07 15:04:12 +09:00
Ian Barwick	043a6c5bea	repmgrd: ensue degraded monitoring timeout works on standby Parameter "degraded_monitoring_timeout" was not being acted on when monitoring a streaming replication standby. Addresses GitHub #439.	2018-06-07 15:03:52 +09:00
Martín Marqués	49418e096e	Fix typo in a code comment	2018-05-19 12:30:03 -03:00
Ian Barwick	6f315c1b3c	repmgrd: don't explicitly close connections on shutdown	2018-05-01 10:21:10 +09:00
Ian Barwick	16048a879e	repmgrd: notify sibling nodes to follow new primary after pg_ctl timeout If "pg_ctl promote" fails due to a timeout, but the promotion itself succeeds, have repmgrd on the new primary explicitly notify any sibling nodes to follow it. Previously the sibling nodes would wait "primary_notification_timeout" seconds before attempting to discover the new primary. This (and preceding commit `eac80ae`) address GitHub #425.	2018-04-27 11:54:21 +09:00

1 2 3

120 Commits