repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-24 15:46:29 +00:00

Author	SHA1	Message	Date
Ian Barwick	5de2b1ee13	repmgrd: update local node id in shared memory after local node restart Also ensure local node restarts are handled more elegantly, so we're not surprised by a stale connection handle. GitHub #502.	2018-09-07 11:59:53 +09:00
Ian Barwick	17e75f6b31	repmgrd: improve reconnection handling Previously, if the server being monitored was not available, repmgrd would always close the existing connection handle and open a new one. However, in some cases, e.g. a brief network outage, the existing connection handle is still good and does not need to be reopened. This could be particularly problematic if monitoring_history is on, as this risks leaving orphan sessions on the primary which (given a sufficiently unstable network) could lead to all available backends being occupied. Instead, during an outage we now use a new connection to verify the server is accessible; if the old connection is still available (e.g. following a short network interruption) we continue using that; if not (e.g. the server was restarted), we use the new one.	2018-08-30 15:46:08 +09:00
Ian Barwick	ceeb6d7130	repmgrd: improve monitoring statistics logging Add more granular logging to help diagnose issues, and also keep track of when the last monitoring statistics update was set and emit that as DETAIL every time we emit a log status update.	2018-08-30 12:36:59 +09:00
Ian Barwick	221fb63e92	repmgrd: fix startup on witness node when local data is stale Previously, when running on a witness server, repmgrd didn't consider the local cache of the "repmgr.nodes" table might be outdated, e.g. as repmgrd wasn't running on the witness server during a failover, so could potentially end up monitoring a former primary now running as a standby. When running on a witness server, at startup repmgrd will now scan all nodes to determine the current primary, and refresh its local cache from there. This will also ensure it can start up even if the node currently registered as primary in the local cache is not available. Implements GitHub #488 and #489.	2018-08-20 15:29:29 +09:00
Ian Barwick	bc584d84f6	repmgrd: improve cascaded standby failover handling In particular, improve handling of the case where the standby follow command fails due to the primary not being available. GitHub #480.	2018-08-20 15:23:54 +09:00
Ian Barwick	76f5bcf3cd	repmgrd: fix PQExpBuffer handling in upstream failover handler Was sometimes leading to blank log lines.	2018-08-20 15:23:50 +09:00
Ian Barwick	b1aab930af	repmgrd: don't imply primary is in recovery if it's not available	2018-08-20 15:23:46 +09:00
Ian Barwick	58994365ff	repmgrd: fix "repmgrd_upstream_reconnect" event notification Upstream node is not always the primary node. Per report in GitHub #480.	2018-08-20 15:23:42 +09:00
Ian Barwick	b61f853a69	repmgrd: ensure primary connection handle is refreshed after reconnect In some circumstances, if monitoring history was in use, repmgrd was attempting to fetch the primary's current LSN on a stale connection handle.	2018-08-15 16:55:03 +09:00
Ian Barwick	44a224ad92	repmgrd: fix configuration file reloading Don't allow "promote_command" or "follow_command" to be empty. GitHub #486.	2018-08-02 16:35:26 +09:00
Ian Barwick	33dedf4e96	repmgrd: always reopen log file after receiving SIGHUP For whatever reason, since at least repmgr 2.0 the log file was only ever reopened if a configuration file change took place. GitHub #485.	2018-08-02 10:54:31 +09:00
Ian Barwick	a87f18682c	repmgrd: consolidate SIGHUP handling Move identical code blocks into single function.	2018-08-02 10:54:12 +09:00
Ian Barwick	bd58e4128c	repmgrd: log "promote_command" at log_level "INFO" If repmgrd is promoting the local node, it was only logging the contents of "promote_command" at DEBUG level; it would be useful to see this at the default log level. Related to GitHub #473.	2018-07-16 15:33:10 +09:00
Ian Barwick	63242e2277	doc: update documentation of "promote_command" and "service_promote_command" The documentation implied it would override "promote_command", which is not the case. "promote_command" is used by repmgrd to execute "repmgr standby promote" (either directly or via a custom script). "service_promote_command" can be set to specify a package-level service command to promote the local PostgreSQL instance from standby to primary, e.g. Debian's pg_ctlcluster. If set, this will be executed by "repmgr standby promote". Also update code comments to clarify usage. Related to GitHub #473.	2018-07-16 14:43:53 +09:00
Ian Barwick	17f30ec364	repmgrd: add additional local node connection check It's possible there are corner-cases where do_election() is called while the local connection is invalid, so perform an additional check.	2018-07-11 15:11:20 +09:00
Ian Barwick	b2081dca52	De-overload configuration file parameter "standby_reconnect_timeout" Currently the (very generic sounding) "standby_reconnect_timeout" configuration file parameter is used in several different contexts and it would be useful to have more granular control over the different timeouts it's used to configure. This patch introduces "node_rejoin_timeout", used in place of "standby_reconnect_timeout" (which wasn't documented) when "repmgr node rejoin" is executed, to determine how long to wait for the node to rejoin the replication cluster. Additionally "repmgrd_standby_startup_timeout" is introduced as a timeout for failover situations, when repmgrd executes "repmgr standby follow" to follow a new primary, and waits for the standby to restart and become available for connections. "standby_reconnect_timeout" is now only relevant for "repmgr standby switchover". Implements GitHub #454.	2018-06-28 18:00:55 +09:00
Ian Barwick	95fe7ea621	repmgrd: ensure local node is counted as quorum member Rename "standby_nodes" to "sibling_nodes" to make it clearer in the code what total is actually provided by the struct. Addresses GitHub #439.	2018-06-07 15:04:12 +09:00
Ian Barwick	043a6c5bea	repmgrd: ensue degraded monitoring timeout works on standby Parameter "degraded_monitoring_timeout" was not being acted on when monitoring a streaming replication standby. Addresses GitHub #439.	2018-06-07 15:03:52 +09:00
Martín Marqués	49418e096e	Fix typo in a code comment	2018-05-19 12:30:03 -03:00
Ian Barwick	6f315c1b3c	repmgrd: don't explicitly close connections on shutdown	2018-05-01 10:21:10 +09:00
Ian Barwick	16048a879e	repmgrd: notify sibling nodes to follow new primary after pg_ctl timeout If "pg_ctl promote" fails due to a timeout, but the promotion itself succeeds, have repmgrd on the new primary explicitly notify any sibling nodes to follow it. Previously the sibling nodes would wait "primary_notification_timeout" seconds before attempting to discover the new primary. This (and preceding commit `eac80ae`) address GitHub #425.	2018-04-27 11:54:21 +09:00
Ian Barwick	eac80ae9c1	repmgrd: handle pg_ctl timeout It's possible "pg_ctl promote" will timeout, causing "repmgr standby follow" to return with an error; however the promotion itself will usually succeed, so detect this case and handle accordingly.	2018-04-26 19:19:42 +09:00
Ian Barwick	7822aa784f	repmgrd: catch corner case in standby connection handle check If repmgrd marks the local node as unavailable, and it was actually restarting but a failover event occured before the next local node check, failover will continue with the stale connection handle. Add a final local node check just before starting the failover process, so repmgrd can reconnect if it wasn't able to before.	2018-04-24 21:56:57 +09:00
Ian Barwick	4455ded935	repmgrd: prevent standby connection handle from going stale If monitoring history not in use, there's no activity on the standby's connection handle, so if e.g. the standby is restarted, PQstatus() never returns CONNECTION_BAD and repmgrd never notices the connection is stale. Therefore execute a throw-away statement at "monitor_interval_secs".	2018-04-24 21:56:52 +09:00
Ian Barwick	fd0b850f41	Minor doc and log output tweaks	2018-04-24 21:08:05 +09:00
Ian Barwick	85ab2d94b7	repmgrd: tweak event notifications on standby failure The event notification was only being created if there was a valid primary connection; it should be created in any case, so an event notification script can be executed.	2018-04-20 10:15:08 +09:00
Ian Barwick	96811ccc01	repmgrd: tweak log notices when marking a standby as failed Announce what we're going to do (set the node record inactive) before performing the action. Makes reading the log slightly easier.	2018-04-03 14:37:43 +09:00
Ian Barwick	73982859f6	repmgrd: improve log output - emit explicit startup NOTICE - emit NOTICE when falling back to degraded monitoring on a primary node - improve log message and event notification details when monitoring a former primary which has been reconnected as a standby	2018-04-03 14:37:06 +09:00
Ian Barwick	5e4bdb5a1b	repmgrd: handle failover with two nodes in the primary location If two nodes were in the primary location, and at least one node in another location, the non-failed node in the primary location was not recognising itself as a promotion candidate. Addresses GitHub #407.	2018-04-02 20:51:27 +09:00
Ian Barwick	a403da67bc	Consolidate connection closure calls	2018-03-27 16:43:59 +09:00
Ian Barwick	0e55a60660	Add event "repmgrd_failover_aborted"	2018-03-21 13:23:06 +09:00
Ian Barwick	81c69e3677	repmgrd: fix typo	2018-03-21 12:36:15 +09:00
Ian Barwick	2a99dfa15b	repmgrd: fix failover handling in "manual" mode Regression was introduced in commit `c7a585c555`	2018-03-07 19:21:40 +09:00
Ian Barwick	cdb504d700	Add event "repmgrd_shutdown" Implements GitHub #393	2018-03-06 11:00:03 +09:00
Ian Barwick	0af2077bed	repmgrd: add debug log output for "monitor_interval_secs" sleep in all modes	2018-03-06 10:56:21 +09:00
Ian Barwick	bc766a48ed	repmgrd: retry standby connection after cascading standby failover	2018-03-02 11:05:07 +09:00
Ian Barwick	55441f2729	repmgrd: add configuration file parameter "standby_reconnect_timeout" This is used for determining a timeout when reconnecting to the standby after executing the "follow_command". This will normally not need to be set explicitly, but maybe useful in cases where the standby's startup phase can last longer than usual.	2018-03-02 11:04:56 +09:00
Ian Barwick	c1356b9e0d	repmgrd: retry standby connection after "follow_command" executed It's possible that the standby is still starting up after the "follow_command" completes, so poll for a while until we get a connection.	2018-03-02 11:04:19 +09:00
Ian Barwick	22b3a74fa0	repmgrd: improve detection of status change from primary to standby If repmgrd is running in degraded mode on a primary which has been stopped, then manually been brought back online as a standby (e.g. by creating recovery.conf and starting the server), ensure it not only detects the change but automatically updates the node record so it can resume monitoring the node as a standby. Previously, repmgrd was looping waiting for the record to be updated (as is done transparently when executing "repmgr node rejoin") but if the record was not updated within the timeout period (e.g. by "repmgr standby register) it would fail to resume monitoring as a standby. It seems reasonable to have repmgrd automatically update the node record, as this will restore failover capability as quickly as possible. If this is not desired, then the onus is on the user to shut down repmgrd while making the desired changes.	2018-02-22 15:50:45 +09:00
Ian Barwick	ec068e38a2	Remove --bdr-only configuration option This was required for a specific use case during pre-release development and is no longer needed now the physical streaming replication handling is implemented.	2018-01-25 10:48:09 +09:00
Ian Barwick	e64d965c6a	repmgrd: document standby_[failure\|recovery] event notifications Also clean up the relevant code section. Addresses GitHub #359.	2018-01-04 09:33:37 +09:00
Ian Barwick	26a9e848fd	Update copyright notices to 2018	2018-01-02 10:19:46 +09:00
Ian Barwick	8c422d6084	Remove unneeded functions	2017-11-20 15:18:21 +09:00
Ian Barwick	08b443dce0	repmgrd: renable monitoring data recording when in archive recovery. The warning emitted gives the impression that monitoring data shouldn't be written if there's no streaming replication, but we can and should do this as long as we have a primary connection. Explictly document this in the code. Also remove an unused variable warning.	2017-11-16 17:17:17 +09:00
Ian Barwick	9d432546bf	repmgrd: don't fail over unless more than 50% of active nodes are visible.	2017-11-15 13:48:28 +09:00
Ian Barwick	3c557ebd8e	repmgrd: finalize witness failover handling	2017-11-15 13:48:25 +09:00
Ian Barwick	4efeb52cba	repmgrd: synchronise repmgr.nodes table on witness server	2017-11-15 13:48:21 +09:00
Ian Barwick	60422c66f9	repmgrd: handle witness server	2017-11-15 13:48:17 +09:00
Ian Barwick	a31980b590	repmgrd: basic witness node monitoring	2017-11-15 13:48:11 +09:00
Ian Barwick	a6cc4d80f0	Add "witness register" functionality	2017-11-15 13:47:45 +09:00

1 2

91 Commits