repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-22 22:56:29 +00:00

Author	SHA1	Message	Date
Ian Barwick	ded20be505	repmgrd: improve node activation at startup Commit `79d1f00` modified repmgrd to automatically set an inactive node to "active" at startup. However we need to avoid doing that for cases where the node role has changed (e.g. a former primary was recloned as a standby) but the node record was not updated.	2021-10-11 14:39:04 +09:00
Ian Barwick	e7e62f7f35	repmgrd: add %p event notification parameter for "repmgrd_failover_promote" This enables an event notification script to identify the former primary node.	2021-09-28 10:25:27 +09:00
Ian Barwick	79d1f005db	repmgrd: activate inactive node record at startup If a PostgreSQL instance was shut down while repmgrd was running, and repmgrd was subsequently restarted (this chain of events could occur during e.g. a server reboot), the node record will have been set to "inactive". Previously, in this case repmgrd would refuse to start up. However, as we can determine the node is running, it should normally be no problem to automatically set the node record to "active". The old behaviour can be restored by setting the new parameter "repmgrd_exit_on_inactive_node" to "true". RM19604.	2021-07-12 17:46:09 +09:00
Ian Barwick	d266df3143	Change copyright information to "EnterpriseDB Corporation" RM20485.	2021-03-01 11:03:52 +09:00
Ian Barwick	b37a599fc6	Update copyright notices to 2021	2021-01-04 12:54:54 +09:00
Ian Barwick	d1cc05faf9	repmgrd: edit code comment for clarity	2020-12-22 13:58:34 +09:00
Josh Soref	f619c3a8ff	Fix various typos in code comments. Via GitHub #687.	2020-12-22 13:43:06 +09:00
Ian Barwick	93187e9743	Add missing connection close In a corner-case situation where a standby is unable to attach to the new primary due to a mismatch in the WAL stream, the connection used to verify the recovery status of the new primary was not being closed, leading to a risk of connection exhaustion on the new primary. Addresses GitHub #682.	2020-12-01 21:33:07 +09:00
Ian Barwick	4d8bc63834	repmgrd: fix issue with incorrect reconnect_interval Addresses GitHub #673.	2020-11-25 20:40:28 +09:00
Ian Barwick	0fc8c6c79c	Add some break statements to silence compiler warnings	2020-10-08 13:10:00 +09:00
Ian Barwick	e10d9fd393	EXPERIMENTAL: synchronise try_primary_reconnect()'s reconnection loop Per proposal in GitHub #662, this patch attempts to synchronise each repmgrd's primary reconnection attempts to prevent potential race conditions. This relies on each node's clock being correcly synchronised. Currently this change is experimental and is not enabled by default. It can be enabled by setting the repmgr.conf parameter "reconnect_loop_sync".	2020-10-06 13:35:49 +09:00
Ian Barwick	42283bf344	repmgrd: check local connection after promoting local node In theory the local connection should not be affected by the node's promotion. However we're handing over control to an external command which is usually just "repmgr standby promote", but could potentially be a user-defined script with unknowable side effects. So it's better to be safe than sorry.	2020-10-05 16:50:41 +09:00
Ian Barwick	5b254a1be9	repmgrd: add parameter "failover_delay" This parameter is not documented and intended for use during testing. It should not be used in production.	2020-10-05 16:43:06 +09:00
Ian Barwick	ce229beff8	repmgrd: add configuration option "always_promote" In certain corner cases, it's possible repmgrd may end up monitoring a standby which was a former primary, but the node record has not yet been updated. Previously repmgrd would abort the promotion with a cryptic message about being unable to find a node record for node_id -1 (the default value for an unknown node id). This commit addes a new configuration option "always_promote", which determines whether repmgrd should promote the node in this case. The default is "false", to effectively maintain the existing behaviour. Logging output has also been improved to make it clearer what has happened when this situation occurs.	2020-09-29 14:18:00 +09:00
Ian Barwick	16eeae700c	repmgrd: minor log message tweak	2020-09-29 10:23:31 +09:00
Stanislav Paskalev	73e8373337	Add %v, %u and %t parameters to "failover_validation_command" These indicate: - the number of visible nodes sharing the current upstream - the number of nodes on the current upstream - the total number of nodes in the entire repmgr cluster. This allows the failover_validation_command to be used to perform more thorough validations, including cross-referencing external cluster management state (e.g. if managed by kubernetes). GitHub #651.	2020-09-17 15:48:12 +09:00
Ian Barwick	a88c80248c	repmgrd: minor tweaks to witness node synchronisation Explicitly roll back if any operation fails, and add debugging output to track elapsed time between synchronisation intervals.	2020-09-01 09:58:14 +09:00
Ian Barwick	e65738c989	Explicitly unset search path when connecting to database	2020-05-22 16:11:55 +09:00
Ian Barwick	d75a35a788	repmgrd: clarify why node is not configured for automatic failover	2020-05-22 11:21:48 +09:00
Ian Barwick	8233560629	repmgrd: ensure cascaded standby reconnects to primary If the primary connection went away, and the upstream is not the primary, attempt to reconnect if the monitoring update fails. If the upstream is the primary, the reconnection will happen on the next connection check.	2020-05-22 11:11:58 +09:00
Ian Barwick	cf60844c45	repmgrd: ensure primary connection is reset if same as upstream Addresses GitHub #633.	2020-05-22 11:11:54 +09:00
Ian Barwick	a863dc7f6c	repmgrd: additional check for the upstream connection It's possible the upstream server was intermittently unavailable in the interval between checks, invalidating the upstream connection. With check types "ping" and "connection", the connection would not be restored, so if the availability check was successful, additionally verify the upstream connection and restore if necessary. Addresses GitHub #633.	2020-05-14 10:26:57 +09:00
Ian Barwick	2f667116d8	repmgrd: include node name in log output Missed in commit `fd52df0`.	2020-05-12 15:31:47 +09:00
Ian Barwick	8ee4fac5bb	repmgrd: minor refactoring of try_primary_reconnect()	2020-05-12 14:52:14 +09:00
Ian Barwick	bb56387aaa	repmgrd: consolidate connection closing code PQfinish() should only be called on local PGconn pointers which will not be reused.	2020-05-12 14:48:39 +09:00
Ian Barwick	5d00094936	repmgrd: ensure "close_connection()" always called after connection failure	2020-05-12 14:41:33 +09:00
Ian Barwick	ebdfdc530d	repmgrd: ensure PQfinish() always executed on failed connections in NodeInfoLists clear_node_info_list() will clean up any remaining active connections, but we need to ensure all failed connections are cleaned up at the point of failure to prevent leaks. Per report in GitHub #643.	2020-05-12 14:22:08 +09:00
Ian Barwick	e5d3285d02	repmgrd: remove redundant log message	2020-05-11 16:59:32 +09:00
Ian Barwick	fd52df0fab	repmgrd: include node name in log output in more places Still a few places where only the node ID was reported, but it's always useful to have the node name as well.	2020-05-11 16:55:31 +09:00
Ian Barwick	bcc284cac9	Refactor configuration file reload handling Rather than parse the configuration file into a new structure and copy changed values from that into the main structure, we'll copy the existing structure before parsing the changed configuration file directly into the nmain structure, and revert using the copy if any issues are encountered. This is necessary as preparation for further reworking of the configuration file structure handling. It also makes the reload idempotent. While we're at it, make some general improvements to the reload handling, particularly: - improve logging to show "before" and "after" values - collate change notifications and only display if no errors were found - remove unnecessary double-logging of errors - various bugfixes	2020-05-05 15:29:07 +09:00
Ian Barwick	3ca642fee1	repmgrd: log receipt of SIGHUP at log level NOTICE PostgreSQL itself logs it at log level LOG, which we don't have, but NOTICE seems reasonable, especially as we log SIGTERM as that.	2020-05-05 13:41:23 +09:00
Ian Barwick	8adcb1348d	repmgrd: improve logging of promote_command failure - log failure before we check if the primary has reappeared - log the error code	2020-04-21 15:02:15 +09:00
Ian Barwick	780453e168	repmgrd: clarify log messages Display the identity of the node question in the meassges fixed in commit 8a27c89; this makes it easier to diagnose log output.	2020-04-03 13:02:49 +09:00
Tom Janson	8a27c89d18	repmgrd: fix inverted log message Warning is emitted when the node in question is in recovery.	2020-04-03 12:39:36 +09:00
Ian Barwick	0a2091d5d3	repmgrd: handle new primary notification during failover check It's possible a repmgrd instance might still be in the primary check phase while a primary has already been promoted. Therefore it's necessary to check for new primary notifications here, so we can follow a new primary as quickly as possible.	2020-04-02 15:45:14 +09:00
Ian Barwick	9de31428f1	Consolidate replication connection code In a few places, replication connections are generated from the parameters used by existing connections. This has resulted in a number of similar blocks of code which do more-or-less the same thing almost but not quite identically. In two cases, the code omitted to set "dbname=replication", which can cause problems in some contexts. These code blocks have now been consolidated into standardized functions. This also resolves the issue addressed by GitHub #619.	2020-03-05 17:21:37 +09:00
Ian Barwick	eaee7145f6	repmgrd: improve logging Note node name and type when logging primary node visibility.	2020-02-24 15:33:03 +09:00
Ian Barwick	e782f2d949	repmgrd: improve logging For easier log analysis, state which node is the current primary.	2020-02-20 13:06:41 +09:00
Ian Barwick	7fdf2f1778	Update copyright notices to 2020	2020-01-13 14:06:20 +09:00
Ian Barwick	2304584679	Fix handling of upstream node change check repmgrd has a check to see if the upstream node has unexpectedly changed, e.g. if the repmgrd service is paused and the PostgreSQL instance has been pointed to another node. However this check was relying on the node record on the local node being up-to-date, which may not be the case immediately after a failover, when the node is still replaying records updated prior to the node's own record being updated. In this case it will mistakenly assume the node is following the original primary and attempt to restart monitoring, which will fail as the original primary is no longer available. To prevent this, we check against the node's record on the upstream node. Addresses issue noted in GitHub #587 and #588.	2019-10-14 12:28:04 +09:00
Ian Barwick	931da14df1	Rename some "repmgr daemon ..." commands to "repmgr service ..." "repmgr daemon" can be interpreted to mean the commands affect the local daemon process only. Rename the commands which affect the entire cluster to "repmgr service ...". The "repmgr daemon ..." form of the affected commands is retained for backwards compatibility.	2019-08-28 14:58:11 +09:00
Ian Barwick	3e812f6e91	repmgrd: always emit NOTICE when attempting to follow a new primary Previously, if a standby's repmgrd was looping in degraded monitoring mode looking for a new primary to follow, once a new primary was detected the follow command would be executed without any prior logging at non-DEBUG log levels.	2019-08-26 16:02:41 +09:00
Ian Barwick	75c0987e79	repmgrd: emit node name when reporting follow target attach error This is consistent with other error messages.	2019-08-13 11:02:52 +09:00
Ian Barwick	d893ce227b	repmgrd: optionally exclude/include witness server from child node checks	2019-06-03 16:04:54 +09:00
Ian Barwick	b5ff2ec120	repmgrd: update log text	2019-05-30 16:08:04 +09:00
Ian Barwick	06a83247c9	repmgrd: note node type when logging child node dis/re-connections	2019-05-30 14:06:54 +09:00
Ian Barwick	a6ea1d0fda	repmgrd: fix witness node disconnection monitoring	2019-05-30 11:51:50 +09:00
Ian Barwick	fa66e72c2f	repmgrd: count witness server as child node for connection monitoring purposes As the witness server does not, by definition, ever have an entry in pg_stat_replication, we need to check its "attached" status by connecting to the witness server itself and querying the reported upstream node ID (which should be set by the witness server repmgrd). If this matches the current primary node ID, we count it as attached.	2019-05-21 15:19:41 +09:00
Ian Barwick	02245a0014	repmgrd: add missing PQfinish() calls	2019-05-02 18:50:21 +09:00
Ian Barwick	52905f1eb3	Standardize on "ID: %i" when logging node IDs Previously there was a mix of "id:", "node id:", "node ID:" and "node_id:".	2019-04-30 17:07:33 +09:00

1 2 3 4 5

206 Commits