repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-22 22:56:29 +00:00

Author	SHA1	Message	Date
Ian Barwick	0a2091d5d3	repmgrd: handle new primary notification during failover check It's possible a repmgrd instance might still be in the primary check phase while a primary has already been promoted. Therefore it's necessary to check for new primary notifications here, so we can follow a new primary as quickly as possible.	2020-04-02 15:45:14 +09:00
Ian Barwick	9de31428f1	Consolidate replication connection code In a few places, replication connections are generated from the parameters used by existing connections. This has resulted in a number of similar blocks of code which do more-or-less the same thing almost but not quite identically. In two cases, the code omitted to set "dbname=replication", which can cause problems in some contexts. These code blocks have now been consolidated into standardized functions. This also resolves the issue addressed by GitHub #619.	2020-03-05 17:21:37 +09:00
Ian Barwick	eaee7145f6	repmgrd: improve logging Note node name and type when logging primary node visibility.	2020-02-24 15:33:03 +09:00
Ian Barwick	e782f2d949	repmgrd: improve logging For easier log analysis, state which node is the current primary.	2020-02-20 13:06:41 +09:00
Ian Barwick	7fdf2f1778	Update copyright notices to 2020	2020-01-13 14:06:20 +09:00
Ian Barwick	2304584679	Fix handling of upstream node change check repmgrd has a check to see if the upstream node has unexpectedly changed, e.g. if the repmgrd service is paused and the PostgreSQL instance has been pointed to another node. However this check was relying on the node record on the local node being up-to-date, which may not be the case immediately after a failover, when the node is still replaying records updated prior to the node's own record being updated. In this case it will mistakenly assume the node is following the original primary and attempt to restart monitoring, which will fail as the original primary is no longer available. To prevent this, we check against the node's record on the upstream node. Addresses issue noted in GitHub #587 and #588.	2019-10-14 12:28:04 +09:00
Ian Barwick	931da14df1	Rename some "repmgr daemon ..." commands to "repmgr service ..." "repmgr daemon" can be interpreted to mean the commands affect the local daemon process only. Rename the commands which affect the entire cluster to "repmgr service ...". The "repmgr daemon ..." form of the affected commands is retained for backwards compatibility.	2019-08-28 14:58:11 +09:00
Ian Barwick	3e812f6e91	repmgrd: always emit NOTICE when attempting to follow a new primary Previously, if a standby's repmgrd was looping in degraded monitoring mode looking for a new primary to follow, once a new primary was detected the follow command would be executed without any prior logging at non-DEBUG log levels.	2019-08-26 16:02:41 +09:00
Ian Barwick	75c0987e79	repmgrd: emit node name when reporting follow target attach error This is consistent with other error messages.	2019-08-13 11:02:52 +09:00
Ian Barwick	d893ce227b	repmgrd: optionally exclude/include witness server from child node checks	2019-06-03 16:04:54 +09:00
Ian Barwick	b5ff2ec120	repmgrd: update log text	2019-05-30 16:08:04 +09:00
Ian Barwick	06a83247c9	repmgrd: note node type when logging child node dis/re-connections	2019-05-30 14:06:54 +09:00
Ian Barwick	a6ea1d0fda	repmgrd: fix witness node disconnection monitoring	2019-05-30 11:51:50 +09:00
Ian Barwick	fa66e72c2f	repmgrd: count witness server as child node for connection monitoring purposes As the witness server does not, by definition, ever have an entry in pg_stat_replication, we need to check its "attached" status by connecting to the witness server itself and querying the reported upstream node ID (which should be set by the witness server repmgrd). If this matches the current primary node ID, we count it as attached.	2019-05-21 15:19:41 +09:00
Ian Barwick	02245a0014	repmgrd: add missing PQfinish() calls	2019-05-02 18:50:21 +09:00
Ian Barwick	52905f1eb3	Standardize on "ID: %i" when logging node IDs Previously there was a mix of "id:", "node id:", "node ID:" and "node_id:".	2019-04-30 17:07:33 +09:00
Ian Barwick	87910a5448	repmgrd: improve logging of sibling node's upstream info If the sibling node has already been promoted (for whatever reason, e.g. "repmgr standby promote" was executed manually) and has exited recovery, the upstream node ID will normally be reported as "-1", which is correct, but looks confusing in the logs. We now only report the upstream node ID if the sibling node is still in recovery, or if it has exited recovery but is still reporting an extant node ID.	2019-04-29 13:51:17 +09:00
Ian Barwick	3231b5034d	Remove temporary debugging log output	2019-04-24 13:17:52 +09:00
Ian Barwick	58b33fb411	Clarify a couple of code comments	2019-04-24 10:55:53 +09:00
Ian Barwick	6cbf436bf8	Don't execute "child_nodes_disconnect_command" when repmgrd paused	2019-04-23 14:08:13 +09:00
Ian Barwick	5a90513878	repmgrd: monitor standbys attached to primary This functionality enables repmgrd (when running on the primary) to monitor connected child nodes. It will log connections and disconnections and generate events. Additionally, repmgrd can execute a custom script if the number of connected child nodes falls below a configurable threshold. This script can be used e.g. to "fence" the primary following a failover situation where a new primary has been promoted and all standbys are now child nodes of that primary.	2019-04-22 16:18:52 +09:00
Ian Barwick	a0c6cb602f	repmgrd: remove duplicate function definition	2019-04-16 10:53:05 +09:00
Ian Barwick	27803f93ff	repmgrd: always unset upstream node ID when monitoring a primary	2019-04-12 12:26:39 +09:00
Ian Barwick	46d17d0933	repmgrd: fix log output	2019-04-11 16:29:08 +09:00
Ian Barwick	6b79e08706	repmgrd: add addiitonal log output in do_election()	2019-04-11 15:46:20 +09:00
Ian Barwick	cd6a55c7cb	repmgrd: improve primary visibility consensus check Exclude sibling nodes which report they're following a different node. This shouldn't happen, but could.	2019-04-11 15:46:14 +09:00
Ian Barwick	008bd00a59	repmgrd: store upstream node ID in shared memory	2019-04-11 15:46:09 +09:00
Ian Barwick	5a8741199f	repmgrd: exclude witness server from followability check	2019-04-11 11:19:12 +09:00
Ian Barwick	9164d3931b	repmgrd: clean up PQExpBuffer handling Unless the PQExpBuffer is required for the duration of the function, ensure it's always a variable local to the relevant code block. This mitigates the risk of accidentally accessing a generically named PQExpBuffer which hasn't been initialised or was previously terminated.	2019-03-26 13:15:25 +09:00
Ian Barwick	801ed2b0c8	repmgrd: don't terminate uninitialized PQExpBuffer	2019-03-26 11:35:45 +09:00
Ian Barwick	539861cb58	repmgrd: during failover, check if a node was already promoted Previously, repmgrd assumed that during a failover, there would not already be another primary node. However it's possible a node was promoted manually. While this is not a desirable situation, it's conceivable this could happen in the wild, so we should check for it and react accordingly. Also sanity-check that the follow target can actually be followed. Addresses issue raised in GitHub #420.	2019-03-22 14:06:41 +09:00
Ian Barwick	7434cc0b8e	repmgrd: improve witness node monitoring Mainly fix a couple of places where "standby" was hard-coded into a log message which can apply either to a witness or a standby.	2019-03-20 11:47:36 +09:00
Ian Barwick	46efe57cd0	Improve database connection failure logging Log the output of PQerrorStatus() in a couple of places where it was missing. Additionally, always log the output of PQerrorStatus() starting with a blank line, otherwise the first line looks like it was emitted by repmgr, and it's harder to scan the error message. Before: [2019-03-20 11:24:15] [DETAIL] could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5501? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5501? After: [2019-03-20 11:27:21] [DETAIL] could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5501? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5501?	2019-03-20 11:47:28 +09:00
Ian Barwick	426759ca8e	check_primary_status(): handle case where recovery type unknown	2019-03-18 16:16:54 +09:00
Ian Barwick	8ab51c2ae3	Refactor check_primary_status() Reduce nested if/else branching, and improve documentation.	2019-03-18 15:01:21 +09:00
Ian Barwick	43f28f4097	Clarify calls to check_primary_status() Use a constant rather than a magic number to indicate non-provision of elapsed degraded monitoring time.	2019-03-18 14:21:34 +09:00
Ian Barwick	c2206b007a	repmgrd: optionally check upstream availability through connection attempts	2019-03-14 15:44:53 +09:00
Ian Barwick	19bf4d7434	Count witness and zero-priority nodes in visibility check	2019-03-14 11:17:51 +09:00
Ian Barwick	56d9f5b856	Ensure witness node sets last upstream seen time	2019-03-14 10:53:47 +09:00
Ian Barwick	c3c58df7b9	repmgrd: improve logging output when executing "failover_validate_command"	2019-03-13 21:07:26 +09:00
Ian Barwick	573d027db6	repmgrd: various minor logging improvements	2019-03-13 11:27:17 +09:00
Ian Barwick	1afb41647b	repmgrd: remove global variable Make the "sibling_nodes" local, and pass by reference where relevant.	2019-03-12 17:12:23 +09:00
Ian Barwick	fc397f25f6	repmgrd: enable election rerun If "failover_validation_command" is set, and the command returns an error, rerun the election. There is a pause between reruns to avoid "churn"; the length of this pause is controlled by the configuration parameter "election_rerun_interval".	2019-03-12 17:12:19 +09:00
Ian Barwick	4ef706c2ca	Execute "failover_validation_command" when only one standby exists	2019-03-08 12:19:37 +09:00
Ian Barwick	db0d71c6a7	Initial implementation of "failover_validation_command"	2019-03-08 08:49:15 +09:00
Ian Barwick	33fefd9f52	Add configuration option "primary_visibility_consensus" This determines whether repmgrd should continue with a failover if one or more nodes report they can still see the standby.	2019-03-07 10:41:42 +09:00
Ian Barwick	a3f90d2bba	Add configuration option "sibling_nodes_disconnect_timeout" This controls the maximum length of time in seconds that repmgrd will wait for other standbys to disconnect their WAL receivers in a failover situation. This setting is only used when "standby_disconnect_on_failover" is set to "true".	2019-03-06 15:56:21 +09:00
Ian Barwick	2ed044c358	Reset "wal_retrieve_retry_interval" for all nodes	2019-03-06 15:55:03 +09:00
Ian Barwick	9823978f41	repmgrd: don't wait for WAL receiver to reconnect during failover If the WAL receiver has been temporarily disabled, we don't want to wait for it to start up as it may not be able to at that point; we do however need to reset "wal_retrieve_retry_interval".	2019-03-06 15:54:56 +09:00
Ian Barwick	f85b4cd98e	Log warning if "standby_disconnect_on_failover" used on pre-9.5 "standby_disconnect_on_failover" requires availability of "wal_retrieve_retry_interval", which is available from PostgreSQL 9.5. 9.4 will fall out of community support this year, so it doesn't seem productive at this point to do anything more than put the onus on the user to read the documentation and heed any warning messages in the logs.	2019-03-06 15:54:15 +09:00

1 2 3 4

172 Commits