repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-23 23:26:30 +00:00

Author	SHA1	Message	Date
Ian Barwick	5886772cdb	Teach witness repmgrd to deal with the absence of a primary Previously it would refuse to start if the primary was not reachable, the thinking being that it's pointless trying to monitor an incomplete cluster. However following an aborted failover situation, repmgrd will restart monitoring and on the witness server, this will lead to it aborting itself due to to continuing absence of primary. To resolve this, witness repmgrd will now start monitoring in degraded mode if no primary is found in the hope a primary will reappear at some point.	2018-11-29 12:17:26 +09:00
Ian Barwick	0cafeb3828	repmgrd: fix upstream role check Only take action if it's confirmed as a standby.	2018-10-23 12:50:04 +09:00
Ian Barwick	77c9092794	repmgrd: improve node role change detection	2018-10-19 11:33:08 +09:00
Ian Barwick	0842560a88	repmgrd: handle case where upstream is no longer primary If the upstream comes back on line (e.g. after a switchover), and its status is no longer primary, restart monitoring to ensure the correct primary (potentially the current node) is being monitored.	2018-10-18 17:04:14 +09:00
Ian Barwick	8bec4946bc	Ensure witness repmgrd detects change in upstream's role This ensures that e.g. after a switchover, repmgrd running on a witness node will automatically detect the new primary and monitor that.	2018-10-18 16:15:52 +09:00
Ian Barwick	3ab22f9442	repmgrd: ensure witness node doesn't try and follow another witness Theoretically there should never be more than one witness node visible here, but it's not impossible to rule it out, so add a check just in case.	2018-10-18 12:20:04 +09:00
Ian Barwick	b2348c9a70	repmgrd: improve promotion script failure handling While scanning for a new primary following a promotion script failure, repmgrd was treating a witness server as a potential new primary and would attempt to "follow" it. Fortunately "repmgr standby follow" would do the right thing and choose the actual primary, if available, otherwise do nothing, so the cluster would eventually end up in the correct state, albeit for the wrong reason. By skipping the witness server as a potential new primary, repmgrd will do the right thing if the original primary does come back online, i.e. resume monitoring as before.	2018-10-16 11:42:54 +09:00
Ian Barwick	3e38759c02	use appendPQExpBufferStr/-Char() consistently	2018-10-04 08:42:42 +09:00
Ian Barwick	2491b8ae52	Add functionality to "pause" repmgrd In some circumstances, e.g. while performing a switchover, it is essential that repmgrd does not take any kind of failover action, as this will put the cluster into an incorrect state. Previously it was necessary to stop repmgrd on all nodes (or at least those nodes which repmgrd would consider as promotion candidates), however this is a cumbersome and potentially risk-prone operation, particularly if the replication cluster contains more than a couple of servers. To prevent this issue from occurring, this patch introduces the ability to "pause" repmgrd on all nodes wth a single command ("repmgr daemon pause") which notifies repmgrd not to take any failover action until the node is "unpaused" ("repmgr daemon unpause"). "repmgr daemon status" provides an overview of each node and whether repmgrd is running, and if so whether it is paused. "repmgr standby switchover" has been modified to automatically pause repmgrd while carrying out the switchover. See documentation for further details.	2018-09-27 16:42:10 +09:00
Ian Barwick	1f8f6f3a39	repmgrd: add notice about different location preventing standby promotion Though we note this in the DEBUG output, it's not immediately obvious from the logs, especially outside of the DEBUG log level, why a node didn't promote itself if it is in a different location to the primary.	2018-09-27 11:06:18 +09:00
Ian Barwick	97905b02ae	repmgrd: fix comment	2018-09-13 10:15:22 +09:00
Ian Barwick	5de2b1ee13	repmgrd: update local node id in shared memory after local node restart Also ensure local node restarts are handled more elegantly, so we're not surprised by a stale connection handle. GitHub #502.	2018-09-07 11:59:53 +09:00
Ian Barwick	17e75f6b31	repmgrd: improve reconnection handling Previously, if the server being monitored was not available, repmgrd would always close the existing connection handle and open a new one. However, in some cases, e.g. a brief network outage, the existing connection handle is still good and does not need to be reopened. This could be particularly problematic if monitoring_history is on, as this risks leaving orphan sessions on the primary which (given a sufficiently unstable network) could lead to all available backends being occupied. Instead, during an outage we now use a new connection to verify the server is accessible; if the old connection is still available (e.g. following a short network interruption) we continue using that; if not (e.g. the server was restarted), we use the new one.	2018-08-30 15:46:08 +09:00
Ian Barwick	ceeb6d7130	repmgrd: improve monitoring statistics logging Add more granular logging to help diagnose issues, and also keep track of when the last monitoring statistics update was set and emit that as DETAIL every time we emit a log status update.	2018-08-30 12:36:59 +09:00
Ian Barwick	221fb63e92	repmgrd: fix startup on witness node when local data is stale Previously, when running on a witness server, repmgrd didn't consider the local cache of the "repmgr.nodes" table might be outdated, e.g. as repmgrd wasn't running on the witness server during a failover, so could potentially end up monitoring a former primary now running as a standby. When running on a witness server, at startup repmgrd will now scan all nodes to determine the current primary, and refresh its local cache from there. This will also ensure it can start up even if the node currently registered as primary in the local cache is not available. Implements GitHub #488 and #489.	2018-08-20 15:29:29 +09:00
Ian Barwick	bc584d84f6	repmgrd: improve cascaded standby failover handling In particular, improve handling of the case where the standby follow command fails due to the primary not being available. GitHub #480.	2018-08-20 15:23:54 +09:00
Ian Barwick	76f5bcf3cd	repmgrd: fix PQExpBuffer handling in upstream failover handler Was sometimes leading to blank log lines.	2018-08-20 15:23:50 +09:00
Ian Barwick	b1aab930af	repmgrd: don't imply primary is in recovery if it's not available	2018-08-20 15:23:46 +09:00
Ian Barwick	58994365ff	repmgrd: fix "repmgrd_upstream_reconnect" event notification Upstream node is not always the primary node. Per report in GitHub #480.	2018-08-20 15:23:42 +09:00
Ian Barwick	b61f853a69	repmgrd: ensure primary connection handle is refreshed after reconnect In some circumstances, if monitoring history was in use, repmgrd was attempting to fetch the primary's current LSN on a stale connection handle.	2018-08-15 16:55:03 +09:00
Ian Barwick	44a224ad92	repmgrd: fix configuration file reloading Don't allow "promote_command" or "follow_command" to be empty. GitHub #486.	2018-08-02 16:35:26 +09:00
Ian Barwick	33dedf4e96	repmgrd: always reopen log file after receiving SIGHUP For whatever reason, since at least repmgr 2.0 the log file was only ever reopened if a configuration file change took place. GitHub #485.	2018-08-02 10:54:31 +09:00
Ian Barwick	a87f18682c	repmgrd: consolidate SIGHUP handling Move identical code blocks into single function.	2018-08-02 10:54:12 +09:00
Ian Barwick	bd58e4128c	repmgrd: log "promote_command" at log_level "INFO" If repmgrd is promoting the local node, it was only logging the contents of "promote_command" at DEBUG level; it would be useful to see this at the default log level. Related to GitHub #473.	2018-07-16 15:33:10 +09:00
Ian Barwick	63242e2277	doc: update documentation of "promote_command" and "service_promote_command" The documentation implied it would override "promote_command", which is not the case. "promote_command" is used by repmgrd to execute "repmgr standby promote" (either directly or via a custom script). "service_promote_command" can be set to specify a package-level service command to promote the local PostgreSQL instance from standby to primary, e.g. Debian's pg_ctlcluster. If set, this will be executed by "repmgr standby promote". Also update code comments to clarify usage. Related to GitHub #473.	2018-07-16 14:43:53 +09:00
Ian Barwick	17f30ec364	repmgrd: add additional local node connection check It's possible there are corner-cases where do_election() is called while the local connection is invalid, so perform an additional check.	2018-07-11 15:11:20 +09:00
Ian Barwick	b2081dca52	De-overload configuration file parameter "standby_reconnect_timeout" Currently the (very generic sounding) "standby_reconnect_timeout" configuration file parameter is used in several different contexts and it would be useful to have more granular control over the different timeouts it's used to configure. This patch introduces "node_rejoin_timeout", used in place of "standby_reconnect_timeout" (which wasn't documented) when "repmgr node rejoin" is executed, to determine how long to wait for the node to rejoin the replication cluster. Additionally "repmgrd_standby_startup_timeout" is introduced as a timeout for failover situations, when repmgrd executes "repmgr standby follow" to follow a new primary, and waits for the standby to restart and become available for connections. "standby_reconnect_timeout" is now only relevant for "repmgr standby switchover". Implements GitHub #454.	2018-06-28 18:00:55 +09:00
Ian Barwick	95fe7ea621	repmgrd: ensure local node is counted as quorum member Rename "standby_nodes" to "sibling_nodes" to make it clearer in the code what total is actually provided by the struct. Addresses GitHub #439.	2018-06-07 15:04:12 +09:00
Ian Barwick	043a6c5bea	repmgrd: ensue degraded monitoring timeout works on standby Parameter "degraded_monitoring_timeout" was not being acted on when monitoring a streaming replication standby. Addresses GitHub #439.	2018-06-07 15:03:52 +09:00
Martín Marqués	49418e096e	Fix typo in a code comment	2018-05-19 12:30:03 -03:00
Ian Barwick	6f315c1b3c	repmgrd: don't explicitly close connections on shutdown	2018-05-01 10:21:10 +09:00
Ian Barwick	16048a879e	repmgrd: notify sibling nodes to follow new primary after pg_ctl timeout If "pg_ctl promote" fails due to a timeout, but the promotion itself succeeds, have repmgrd on the new primary explicitly notify any sibling nodes to follow it. Previously the sibling nodes would wait "primary_notification_timeout" seconds before attempting to discover the new primary. This (and preceding commit `eac80ae`) address GitHub #425.	2018-04-27 11:54:21 +09:00
Ian Barwick	eac80ae9c1	repmgrd: handle pg_ctl timeout It's possible "pg_ctl promote" will timeout, causing "repmgr standby follow" to return with an error; however the promotion itself will usually succeed, so detect this case and handle accordingly.	2018-04-26 19:19:42 +09:00
Ian Barwick	7822aa784f	repmgrd: catch corner case in standby connection handle check If repmgrd marks the local node as unavailable, and it was actually restarting but a failover event occured before the next local node check, failover will continue with the stale connection handle. Add a final local node check just before starting the failover process, so repmgrd can reconnect if it wasn't able to before.	2018-04-24 21:56:57 +09:00
Ian Barwick	4455ded935	repmgrd: prevent standby connection handle from going stale If monitoring history not in use, there's no activity on the standby's connection handle, so if e.g. the standby is restarted, PQstatus() never returns CONNECTION_BAD and repmgrd never notices the connection is stale. Therefore execute a throw-away statement at "monitor_interval_secs".	2018-04-24 21:56:52 +09:00
Ian Barwick	fd0b850f41	Minor doc and log output tweaks	2018-04-24 21:08:05 +09:00
Ian Barwick	85ab2d94b7	repmgrd: tweak event notifications on standby failure The event notification was only being created if there was a valid primary connection; it should be created in any case, so an event notification script can be executed.	2018-04-20 10:15:08 +09:00
Ian Barwick	96811ccc01	repmgrd: tweak log notices when marking a standby as failed Announce what we're going to do (set the node record inactive) before performing the action. Makes reading the log slightly easier.	2018-04-03 14:37:43 +09:00
Ian Barwick	73982859f6	repmgrd: improve log output - emit explicit startup NOTICE - emit NOTICE when falling back to degraded monitoring on a primary node - improve log message and event notification details when monitoring a former primary which has been reconnected as a standby	2018-04-03 14:37:06 +09:00
Ian Barwick	5e4bdb5a1b	repmgrd: handle failover with two nodes in the primary location If two nodes were in the primary location, and at least one node in another location, the non-failed node in the primary location was not recognising itself as a promotion candidate. Addresses GitHub #407.	2018-04-02 20:51:27 +09:00
Ian Barwick	a403da67bc	Consolidate connection closure calls	2018-03-27 16:43:59 +09:00
Ian Barwick	0e55a60660	Add event "repmgrd_failover_aborted"	2018-03-21 13:23:06 +09:00
Ian Barwick	81c69e3677	repmgrd: fix typo	2018-03-21 12:36:15 +09:00
Ian Barwick	2a99dfa15b	repmgrd: fix failover handling in "manual" mode Regression was introduced in commit `c7a585c555`	2018-03-07 19:21:40 +09:00
Ian Barwick	cdb504d700	Add event "repmgrd_shutdown" Implements GitHub #393	2018-03-06 11:00:03 +09:00
Ian Barwick	0af2077bed	repmgrd: add debug log output for "monitor_interval_secs" sleep in all modes	2018-03-06 10:56:21 +09:00
Ian Barwick	bc766a48ed	repmgrd: retry standby connection after cascading standby failover	2018-03-02 11:05:07 +09:00
Ian Barwick	55441f2729	repmgrd: add configuration file parameter "standby_reconnect_timeout" This is used for determining a timeout when reconnecting to the standby after executing the "follow_command". This will normally not need to be set explicitly, but maybe useful in cases where the standby's startup phase can last longer than usual.	2018-03-02 11:04:56 +09:00
Ian Barwick	c1356b9e0d	repmgrd: retry standby connection after "follow_command" executed It's possible that the standby is still starting up after the "follow_command" completes, so poll for a while until we get a connection.	2018-03-02 11:04:19 +09:00
Ian Barwick	22b3a74fa0	repmgrd: improve detection of status change from primary to standby If repmgrd is running in degraded mode on a primary which has been stopped, then manually been brought back online as a standby (e.g. by creating recovery.conf and starting the server), ensure it not only detects the change but automatically updates the node record so it can resume monitoring the node as a standby. Previously, repmgrd was looping waiting for the record to be updated (as is done transparently when executing "repmgr node rejoin") but if the record was not updated within the timeout period (e.g. by "repmgr standby register) it would fail to resume monitoring as a standby. It seems reasonable to have repmgrd automatically update the node record, as this will restore failover capability as quickly as possible. If this is not desired, then the onus is on the user to shut down repmgrd while making the desired changes.	2018-02-22 15:50:45 +09:00

1 2 3

102 Commits