repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-22 22:56:29 +00:00

Author	SHA1	Message	Date
Ian Barwick	bb56387aaa	repmgrd: consolidate connection closing code PQfinish() should only be called on local PGconn pointers which will not be reused.	2020-05-12 14:48:39 +09:00
Ian Barwick	5d00094936	repmgrd: ensure "close_connection()" always called after connection failure	2020-05-12 14:41:33 +09:00
Ian Barwick	ebdfdc530d	repmgrd: ensure PQfinish() always executed on failed connections in NodeInfoLists clear_node_info_list() will clean up any remaining active connections, but we need to ensure all failed connections are cleaned up at the point of failure to prevent leaks. Per report in GitHub #643.	2020-05-12 14:22:08 +09:00
Ian Barwick	e5d3285d02	repmgrd: remove redundant log message	2020-05-11 16:59:32 +09:00
Ian Barwick	fd52df0fab	repmgrd: include node name in log output in more places Still a few places where only the node ID was reported, but it's always useful to have the node name as well.	2020-05-11 16:55:31 +09:00
Ian Barwick	bcc284cac9	Refactor configuration file reload handling Rather than parse the configuration file into a new structure and copy changed values from that into the main structure, we'll copy the existing structure before parsing the changed configuration file directly into the nmain structure, and revert using the copy if any issues are encountered. This is necessary as preparation for further reworking of the configuration file structure handling. It also makes the reload idempotent. While we're at it, make some general improvements to the reload handling, particularly: - improve logging to show "before" and "after" values - collate change notifications and only display if no errors were found - remove unnecessary double-logging of errors - various bugfixes	2020-05-05 15:29:07 +09:00
Ian Barwick	3ca642fee1	repmgrd: log receipt of SIGHUP at log level NOTICE PostgreSQL itself logs it at log level LOG, which we don't have, but NOTICE seems reasonable, especially as we log SIGTERM as that.	2020-05-05 13:41:23 +09:00
Ian Barwick	8adcb1348d	repmgrd: improve logging of promote_command failure - log failure before we check if the primary has reappeared - log the error code	2020-04-21 15:02:15 +09:00
Ian Barwick	780453e168	repmgrd: clarify log messages Display the identity of the node question in the meassges fixed in commit 8a27c89; this makes it easier to diagnose log output.	2020-04-03 13:02:49 +09:00
Tom Janson	8a27c89d18	repmgrd: fix inverted log message Warning is emitted when the node in question is in recovery.	2020-04-03 12:39:36 +09:00
Ian Barwick	0a2091d5d3	repmgrd: handle new primary notification during failover check It's possible a repmgrd instance might still be in the primary check phase while a primary has already been promoted. Therefore it's necessary to check for new primary notifications here, so we can follow a new primary as quickly as possible.	2020-04-02 15:45:14 +09:00
Ian Barwick	9de31428f1	Consolidate replication connection code In a few places, replication connections are generated from the parameters used by existing connections. This has resulted in a number of similar blocks of code which do more-or-less the same thing almost but not quite identically. In two cases, the code omitted to set "dbname=replication", which can cause problems in some contexts. These code blocks have now been consolidated into standardized functions. This also resolves the issue addressed by GitHub #619.	2020-03-05 17:21:37 +09:00
Ian Barwick	eaee7145f6	repmgrd: improve logging Note node name and type when logging primary node visibility.	2020-02-24 15:33:03 +09:00
Ian Barwick	e782f2d949	repmgrd: improve logging For easier log analysis, state which node is the current primary.	2020-02-20 13:06:41 +09:00
Ian Barwick	7fdf2f1778	Update copyright notices to 2020	2020-01-13 14:06:20 +09:00
Ian Barwick	2304584679	Fix handling of upstream node change check repmgrd has a check to see if the upstream node has unexpectedly changed, e.g. if the repmgrd service is paused and the PostgreSQL instance has been pointed to another node. However this check was relying on the node record on the local node being up-to-date, which may not be the case immediately after a failover, when the node is still replaying records updated prior to the node's own record being updated. In this case it will mistakenly assume the node is following the original primary and attempt to restart monitoring, which will fail as the original primary is no longer available. To prevent this, we check against the node's record on the upstream node. Addresses issue noted in GitHub #587 and #588.	2019-10-14 12:28:04 +09:00
Ian Barwick	931da14df1	Rename some "repmgr daemon ..." commands to "repmgr service ..." "repmgr daemon" can be interpreted to mean the commands affect the local daemon process only. Rename the commands which affect the entire cluster to "repmgr service ...". The "repmgr daemon ..." form of the affected commands is retained for backwards compatibility.	2019-08-28 14:58:11 +09:00
Ian Barwick	3e812f6e91	repmgrd: always emit NOTICE when attempting to follow a new primary Previously, if a standby's repmgrd was looping in degraded monitoring mode looking for a new primary to follow, once a new primary was detected the follow command would be executed without any prior logging at non-DEBUG log levels.	2019-08-26 16:02:41 +09:00
Ian Barwick	75c0987e79	repmgrd: emit node name when reporting follow target attach error This is consistent with other error messages.	2019-08-13 11:02:52 +09:00
Ian Barwick	d893ce227b	repmgrd: optionally exclude/include witness server from child node checks	2019-06-03 16:04:54 +09:00
Ian Barwick	b5ff2ec120	repmgrd: update log text	2019-05-30 16:08:04 +09:00
Ian Barwick	06a83247c9	repmgrd: note node type when logging child node dis/re-connections	2019-05-30 14:06:54 +09:00
Ian Barwick	a6ea1d0fda	repmgrd: fix witness node disconnection monitoring	2019-05-30 11:51:50 +09:00
Ian Barwick	fa66e72c2f	repmgrd: count witness server as child node for connection monitoring purposes As the witness server does not, by definition, ever have an entry in pg_stat_replication, we need to check its "attached" status by connecting to the witness server itself and querying the reported upstream node ID (which should be set by the witness server repmgrd). If this matches the current primary node ID, we count it as attached.	2019-05-21 15:19:41 +09:00
Ian Barwick	02245a0014	repmgrd: add missing PQfinish() calls	2019-05-02 18:50:21 +09:00
Ian Barwick	52905f1eb3	Standardize on "ID: %i" when logging node IDs Previously there was a mix of "id:", "node id:", "node ID:" and "node_id:".	2019-04-30 17:07:33 +09:00
Ian Barwick	87910a5448	repmgrd: improve logging of sibling node's upstream info If the sibling node has already been promoted (for whatever reason, e.g. "repmgr standby promote" was executed manually) and has exited recovery, the upstream node ID will normally be reported as "-1", which is correct, but looks confusing in the logs. We now only report the upstream node ID if the sibling node is still in recovery, or if it has exited recovery but is still reporting an extant node ID.	2019-04-29 13:51:17 +09:00
Ian Barwick	3231b5034d	Remove temporary debugging log output	2019-04-24 13:17:52 +09:00
Ian Barwick	58b33fb411	Clarify a couple of code comments	2019-04-24 10:55:53 +09:00
Ian Barwick	6cbf436bf8	Don't execute "child_nodes_disconnect_command" when repmgrd paused	2019-04-23 14:08:13 +09:00
Ian Barwick	5a90513878	repmgrd: monitor standbys attached to primary This functionality enables repmgrd (when running on the primary) to monitor connected child nodes. It will log connections and disconnections and generate events. Additionally, repmgrd can execute a custom script if the number of connected child nodes falls below a configurable threshold. This script can be used e.g. to "fence" the primary following a failover situation where a new primary has been promoted and all standbys are now child nodes of that primary.	2019-04-22 16:18:52 +09:00
Ian Barwick	a0c6cb602f	repmgrd: remove duplicate function definition	2019-04-16 10:53:05 +09:00
Ian Barwick	27803f93ff	repmgrd: always unset upstream node ID when monitoring a primary	2019-04-12 12:26:39 +09:00
Ian Barwick	46d17d0933	repmgrd: fix log output	2019-04-11 16:29:08 +09:00
Ian Barwick	6b79e08706	repmgrd: add addiitonal log output in do_election()	2019-04-11 15:46:20 +09:00
Ian Barwick	cd6a55c7cb	repmgrd: improve primary visibility consensus check Exclude sibling nodes which report they're following a different node. This shouldn't happen, but could.	2019-04-11 15:46:14 +09:00
Ian Barwick	008bd00a59	repmgrd: store upstream node ID in shared memory	2019-04-11 15:46:09 +09:00
Ian Barwick	5a8741199f	repmgrd: exclude witness server from followability check	2019-04-11 11:19:12 +09:00
Ian Barwick	9164d3931b	repmgrd: clean up PQExpBuffer handling Unless the PQExpBuffer is required for the duration of the function, ensure it's always a variable local to the relevant code block. This mitigates the risk of accidentally accessing a generically named PQExpBuffer which hasn't been initialised or was previously terminated.	2019-03-26 13:15:25 +09:00
Ian Barwick	801ed2b0c8	repmgrd: don't terminate uninitialized PQExpBuffer	2019-03-26 11:35:45 +09:00
Ian Barwick	539861cb58	repmgrd: during failover, check if a node was already promoted Previously, repmgrd assumed that during a failover, there would not already be another primary node. However it's possible a node was promoted manually. While this is not a desirable situation, it's conceivable this could happen in the wild, so we should check for it and react accordingly. Also sanity-check that the follow target can actually be followed. Addresses issue raised in GitHub #420.	2019-03-22 14:06:41 +09:00
Ian Barwick	7434cc0b8e	repmgrd: improve witness node monitoring Mainly fix a couple of places where "standby" was hard-coded into a log message which can apply either to a witness or a standby.	2019-03-20 11:47:36 +09:00
Ian Barwick	46efe57cd0	Improve database connection failure logging Log the output of PQerrorStatus() in a couple of places where it was missing. Additionally, always log the output of PQerrorStatus() starting with a blank line, otherwise the first line looks like it was emitted by repmgr, and it's harder to scan the error message. Before: [2019-03-20 11:24:15] [DETAIL] could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5501? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5501? After: [2019-03-20 11:27:21] [DETAIL] could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5501? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5501?	2019-03-20 11:47:28 +09:00
Ian Barwick	426759ca8e	check_primary_status(): handle case where recovery type unknown	2019-03-18 16:16:54 +09:00
Ian Barwick	8ab51c2ae3	Refactor check_primary_status() Reduce nested if/else branching, and improve documentation.	2019-03-18 15:01:21 +09:00
Ian Barwick	43f28f4097	Clarify calls to check_primary_status() Use a constant rather than a magic number to indicate non-provision of elapsed degraded monitoring time.	2019-03-18 14:21:34 +09:00
Ian Barwick	c2206b007a	repmgrd: optionally check upstream availability through connection attempts	2019-03-14 15:44:53 +09:00
Ian Barwick	19bf4d7434	Count witness and zero-priority nodes in visibility check	2019-03-14 11:17:51 +09:00
Ian Barwick	56d9f5b856	Ensure witness node sets last upstream seen time	2019-03-14 10:53:47 +09:00
Ian Barwick	c3c58df7b9	repmgrd: improve logging output when executing "failover_validate_command"	2019-03-13 21:07:26 +09:00

1 2 3 4

182 Commits