repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-23 15:16:29 +00:00

Author	SHA1	Message	Date
Ian Barwick	59b7453bbf	repmgrd: optionally disconnect WAL receivers during failover This is intended to ensure that all nodes have a constant LSN while making the failover decision. This feature is experimental and needs to be explicitly enabled with the configuration file option "standby_disconnect_on_failover". Note enabling this option will result in a delay in the failover decision until the WAL receiver is disconnected on all nodes.	2019-03-08 15:27:54 +09:00
Ian Barwick	bc6584a90d	*_transaction() functions: log error message text as DETAIL Per behaviour elsewhere.	2019-03-06 13:23:57 +09:00
Ian Barwick	074d79b44f	repmgrd: add option "connection_check_type" This enable selection of the method repmgrd uses to check whether the upstream node is available. Possible values are: - "ping" (default): uses PQping() to check server availability - "connection": executes a query on the connection to check server availability (similar to repmgr3.x).	2019-03-06 13:23:53 +09:00
Ian Barwick	19bcfa7264	Rename "..._primary_last_seen" functions to "..._upstream_last_seen" As that better reflects what they do.	2019-03-06 13:23:33 +09:00
Ian Barwick	9753bcc8c3	repmgrd: during failover, check if other nodes have seen the primary In a situation where only some standbys are cut off from the primary, a failover would result in a split brain/split cluster situation, as it's likely one of the cut-off standbys will promote itself, and other cut-off standbys (but not all standbys) will follow it. To prevent this happening, interrogate the other sibiling nodes to check whether they've seen the primary within a reasonably short interval; if this is the case, do not take any failover action. This feature is experimental.	2019-03-06 13:23:22 +09:00
Ian Barwick	39234afcbf	standby clone: check upstream connections after data copy operation With long-running copy operations, it's possible the connection(s) to the primary/source server may go away for some reason, so recheck their availability before attempting to reuse.	2019-02-26 14:37:51 +09:00
Ian Barwick	c30e65b3f2	Add some missing query error logging	2019-02-25 13:02:45 +09:00
Ian Barwick	07097575b1	daemon status: add column "upstream last seen" This displays the interval (in seconds) since the repmgrd instance on each node last confirmed its upstream node is available.	2019-02-23 13:03:16 +09:00
Ian Barwick	71d151ca87	Don't check status of logical replication slots We only want to check the status of physical replication slots to determine whether a streaming replication standby has become detached and there is therefore a risk of uncontrolled WAL buildup on the local node. It's not feasible to second-guess the state of logical replication slots.	2019-02-23 10:09:43 +09:00
Ian Barwick	de70fd42dc	node check: simplify output generation in --is-shutdown-cleanly check	2019-02-22 10:49:06 +09:00
Ian Barwick	85a97c933f	Handle unhandled NodeStatus in switch statement	2019-02-15 19:31:06 +09:00
Ian Barwick	9305953bd2	Fix history file parsing Also add additional debugging output.	2019-02-14 15:52:40 +09:00
Ian Barwick	25019d1cc5	Refactor is_wal_replay_paused() query Make sure it doesn't emit an error if executed on a node not in recovery. The caller should theoretically only execute it on nodes in recovery, but there are sure to be corner cases where the node has come out of recovery.	2019-02-12 10:21:05 +09:00
Ian Barwick	f0a0be0248	Remove pointless default allocation in _get_node_record()	2019-02-07 11:41:08 +09:00
Ian Barwick	c7b325e2a4	Add function resume_wal_replay()	2019-02-07 11:33:02 +09:00
Ian Barwick	b89941f218	Store WAL replay pause status in ReplInfo struct	2019-02-07 10:24:42 +09:00
Ian Barwick	2b3b1faa20	refactor query in function get_replication_info() In particular handle all cases where one of the functions called in the query can return NULL in the query itself.	2019-02-06 15:40:27 +09:00
Ian Barwick	984ce7420b	"daemon status": emit warning if WAL replay is paused Specifically, if WAL replay is paused and WAL is pending replay, this node cannot be promoted until WAL replay is unpaused. In this state it is not a suitable promotion candidate in a failover situation.	2019-02-06 13:32:20 +09:00
Ian Barwick	cd3312496e	Rename functions which return an LSN for clarity	2019-02-06 09:32:53 +09:00
Ian Barwick	f62b3b2868	Fix Pg10+ function names	2019-02-05 13:37:35 +09:00
Ian Barwick	701944c194	"standby promote": add check for WAL replay status if replay is paused If WAL replay is paused but WAL is still pending replay, PostgreSQL will ignore the promote request until WAL replay is unpaused. This may lead to the standby being promoted at an unpredictable point in time outside of repmgr's control. Moreover it may not be obvious that this is happening, or why, and it will appear that an apparently successful promotion attempt has not actually worked. To prevent this from happening, repmgr will now refuse to promote the standy if WAL replay is paused and WAL is still pending replay. GitHub #540.	2019-02-05 13:30:37 +09:00
Ian Barwick	92c73b68a0	Clean up dbutils.c Put functions into the same "section" as noted in the header file.	2019-02-05 09:36:54 +09:00
Ian Barwick	f9a1861ded	Refactor ReplInfo struct handling Eventually we'll want to have this contain the optional replication info contained in the t_node_info struct, which should then contain a pointer to a ReplInfo struct.	2019-02-02 18:39:24 +09:00
Ian Barwick	20b79f998c	Define some previously magic numbers	2019-02-01 19:14:16 +09:00
Ian Barwick	bdb4f66a9d	Add an Assert() to detect attempted array overflow in param_set...() functions Previously the code would do nothing if an attempt was made to add parameters if the array is already full. As the array is designed to contain all valid libpq connection parameters, there's no reason it should ever "overflow" like this. If there is, then it means the caller is attempting to add invalid values. Add an Assert() so we can easily detect this in the unlikely event it ever occurs. Noted after examining the issue raised in GitHub #533, which is nonsensical as it implies we'd be OK with writing beyond the end of the array, however it doesn't hurt to make it a bit clearer what is happening and why.	2019-01-31 14:11:00 +09:00
Ian Barwick	32b81e7d49	"daemon start": initial implementation	2019-01-29 13:01:14 +09:00
Ian Barwick	a48d408e4e	Consistently log strerror output as DETAIL	2019-01-29 12:10:55 +09:00
Ian Barwick	1980deb480	repmgrd: check for a change to the upstream node If the upstream node has changed, for example after "repmgr standby follow" was manually executed, restart monitoring to ensure repmgrd is monitoring the correct node.	2019-01-22 13:33:13 +09:00
Ian Barwick	7dce3ed234	Update copyright notices to 2019	2019-01-21 14:54:35 +09:00
Ian Barwick	d4e993a240	Improve handling of connection URIs when executing remote commands Previously, if connection URIs were in use and "repmgr standby switchover" was executed, repmgr would pass the connection URI as-is to the demotion candidate to execute "repmgr node rejoin". However the presence of unescaped ampersands in the connection URI was causing the rejoin command to be incorrectly executed. Addresses GitHub #525.	2019-01-14 11:11:51 +09:00
Ian Barwick	40408a1734	repmgrd: check binary and extension major versions match repmgr requires that the same "major version" (e.g. 4.3) is present on all nodes, otherwise - particularly in the case of repmgrd - it's highly likely things won't work as expected. Implements part of GitHub #515.	2019-01-07 15:39:40 +09:00
Ian Barwick	313aa3c5d7	Refactor follow verification to reduce need for CHECKPOINT A CHECKPOINT is not always required; hopefully we can narrow it down to one corner case where we need to determine the minium recovery location. Also get local timeline ID via IDENTIFY_SYSTEM, as fetching it from pg_control risks returning the prior timeline ID if the timeline switch has just taken place and no restart point has yet occurred.	2018-12-04 15:27:22 +09:00
Ian Barwick	c53782cda3	Fix typo in query	2018-11-29 15:24:49 +09:00
Ian Barwick	66b40ffc68	Simplify function create_replication_slot() Following the changes in `793d83b`, it's no longer necessary to pass the server version number.	2018-11-29 14:35:01 +09:00
Ian Barwick	a6a2be2239	Teach witness repmgrd to deal with the absence of a primary Previously it would refuse to start if the primary was not reachable, the thinking being that it's pointless trying to monitor an incomplete cluster. However following an aborted failover situation, repmgrd will restart monitoring and on the witness server, this will lead to it aborting itself due to to continuing absence of primary. To resolve this, witness repmgrd will now start monitoring in degraded mode if no primary is found in the hope a primary will reappear at some point.	2018-11-29 12:15:41 +09:00
Ian Barwick	bdcc4d9e83	Check correct result status in ...primary_last_seen() functions	2018-11-29 11:08:28 +09:00
Ian Barwick	793d83b22c	Refactor server version detection Most of the time we can simply get the version number directly from the connection handle. Previously it was held in a global variable, which was an icky way of doing things. In a few special cases we also need the actual version string, which is obtained directly from the database.	2018-11-22 21:30:31 +09:00
Ian Barwick	0f4e04e61e	Add function get_current_lsn() This is a somewhat convoluted attempt to retrieve the current LSN of any node, regardless of whether in recovery or not, and if in recovery, independent of whether streaming or recovering from archive.	2018-11-22 19:31:49 +09:00
Ian Barwick	80a280cbf4	Add function get_timeline_history() This will be required for verifying whether one node is able to follow another node.	2018-11-22 15:26:50 +09:00
Ian Barwick	784c9c4793	repmgrd: return predictable default values for get_primary_last_seen() Return 0 if the node is not in recovery. In which case it's probably rather pointless calling this function anyway. Return -1 if the "last_seen" field has never been set (i.e. repmgrd hasn't started yet).	2018-11-21 11:30:32 +09:00
Ian Barwick	0caec90d81	repmgrd: set primary last seen	2018-11-21 11:30:27 +09:00
Ian Barwick	c3bc5585d9	Add sanity check for extension version This should cover the cases where the "repmgr" extension was installed manually but not updated, or an upgrade was not fully completed.	2018-10-31 11:16:36 +09:00
Ian Barwick	c336e384ab	Support "pg_promote()" function (PostgreSQL 12 and later) This is an experimental feature.	2018-10-26 11:02:45 +09:00
Ian Barwick	a459c60145	Avoid defining variable-length arrays As of PostgreSQL commit d9dd406f, variable length arrays are no longer permitted. As they're not actually required anyway, just define appropriate constants. Also noted in GitHub #510.	2018-10-26 10:09:45 +09:00
Ian Barwick	3e38759c02	use appendPQExpBufferStr/-Char() consistently	2018-10-04 08:42:42 +09:00
Ian Barwick	2491b8ae52	Add functionality to "pause" repmgrd In some circumstances, e.g. while performing a switchover, it is essential that repmgrd does not take any kind of failover action, as this will put the cluster into an incorrect state. Previously it was necessary to stop repmgrd on all nodes (or at least those nodes which repmgrd would consider as promotion candidates), however this is a cumbersome and potentially risk-prone operation, particularly if the replication cluster contains more than a couple of servers. To prevent this issue from occurring, this patch introduces the ability to "pause" repmgrd on all nodes wth a single command ("repmgr daemon pause") which notifies repmgrd not to take any failover action until the node is "unpaused" ("repmgr daemon unpause"). "repmgr daemon status" provides an overview of each node and whether repmgrd is running, and if so whether it is paused. "repmgr standby switchover" has been modified to automatically pause repmgrd while carrying out the switchover. See documentation for further details.	2018-09-27 16:42:10 +09:00
Ian Barwick	688337dec3	repmgr: add "--node-id" option to "cluster cleanup" Implements GitHub #493.	2018-09-25 15:56:40 +09:00
Ian Barwick	b0a2ee2259	get_all_node_records(): display any error encountered and return success status In many cases we'll want to bail out with an error if the node list can't be retrieved for any reason. This saves some repetitive coding.	2018-09-13 10:14:43 +09:00
Ian Barwick	17e75f6b31	repmgrd: improve reconnection handling Previously, if the server being monitored was not available, repmgrd would always close the existing connection handle and open a new one. However, in some cases, e.g. a brief network outage, the existing connection handle is still good and does not need to be reopened. This could be particularly problematic if monitoring_history is on, as this risks leaving orphan sessions on the primary which (given a sufficiently unstable network) could lead to all available backends being occupied. Instead, during an outage we now use a new connection to verify the server is accessible; if the old connection is still available (e.g. following a short network interruption) we continue using that; if not (e.g. the server was restarted), we use the new one.	2018-08-30 15:46:08 +09:00
Ian Barwick	ceeb6d7130	repmgrd: improve monitoring statistics logging Add more granular logging to help diagnose issues, and also keep track of when the last monitoring statistics update was set and emit that as DETAIL every time we emit a log status update.	2018-08-30 12:36:59 +09:00

1 2 3 4 5

245 Commits