repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-22 22:56:29 +00:00

Author	SHA1	Message	Date
Ian Barwick	c560dfbbce	cluster show: display timeline ID This helps provide a better picture of the state of the cluster, i.e. making it more obvious whether there's been a timeline divergence. This also provides infrastructure for further improvements in cluster status display and diagnosis. Note this is only available in PostgreSQL 9.6 and later as it relies on the SQL functions for interrogating pg_control, which can be executed remotely. As PostgreSQL 9.5 will shortly be the only community-supported version without these functions, it's not worth the effort of trying to duplicate their functionality.	2019-05-27 09:39:19 +09:00
Ian Barwick	c9e85996f5	repmgr: prevent a standby being cloned from a witness server Previously repmgr would happily clone from whatever server it found at the provided source server address. We should ensure that a standby can only be cloned from a node which is part of the main replication cluster. This check fetches a list of nodes from the source server, connects to the first non-witness server it finds, and compares the system identifiers of the source node and the node it has connected to. If there is a mismatch, then the source server is clearly not part of the main replication cluster, and is most likely the witness server.	2019-05-22 16:52:25 +09:00
Ian Barwick	dd78a16006	Change return type of is_downstream_node_attached() from bool to NodeAttached This enables us to better determine whether a node is definitively attached, definitively not attached, or if it was not possible to determine the attached state.	2019-05-14 15:57:20 +09:00
Ian Barwick	89a7261483	Always quote node names in log messages	2019-04-30 15:52:56 +09:00
Ian Barwick	9fe2fa2daf	daemon status: make output more like that of "cluster show" In particular make any issues with unexpected server state more obvious.	2019-04-25 14:45:41 +09:00
Ian Barwick	5a90513878	repmgrd: monitor standbys attached to primary This functionality enables repmgrd (when running on the primary) to monitor connected child nodes. It will log connections and disconnections and generate events. Additionally, repmgrd can execute a custom script if the number of connected child nodes falls below a configurable threshold. This script can be used e.g. to "fence" the primary following a failover situation where a new primary has been promoted and all standbys are now child nodes of that primary.	2019-04-22 16:18:52 +09:00
Ian Barwick	27803f93ff	repmgrd: always unset upstream node ID when monitoring a primary	2019-04-12 12:26:39 +09:00
Ian Barwick	cd6a55c7cb	repmgrd: improve primary visibility consensus check Exclude sibling nodes which report they're following a different node. This shouldn't happen, but could.	2019-04-11 15:46:14 +09:00
Ian Barwick	008bd00a59	repmgrd: store upstream node ID in shared memory	2019-04-11 15:46:09 +09:00
Ian Barwick	dd454a8374	Miscellaneous string handling cleanup This is mainly to prevent effectively spurious truncation warnings in recent GCC versions.	2019-04-10 16:18:56 +09:00
Ian Barwick	a564f365c1	Fix default return value in alter_system_int()	2019-04-01 14:50:19 +09:00
Ian Barwick	799ac6d453	Add is_server_available_quiet() For use in cases where the caller collates node availability information and doesn't want to prematurely emit log output.	2019-04-01 12:27:30 +09:00
Ian Barwick	57c0ccd477	Improve copying of strings from database results Where feasible, specify the maximum string length via sizeof(), and use snprintf() in place of strncpy().	2019-04-01 11:19:58 +09:00
Ian Barwick	ece20f4831	Cast "int" to "long long"	2019-03-28 11:02:25 +09:00
Ian Barwick	ba1f05ece9	Restrict "node_name" to maximum 63 characters In "recovery.conf", the configuration parameter "node_name" is used as the "application_name" value, which will be truncated by PostgreSQL to 63 characters (NAMEDATALEN - 1). repmgr sometimes needs to be able to extract the application name from pg_stat_replication to determine if a node is connected (e.g. when executing "repmgr standby register"), so the comparison will fail if "node_name" exceeds 63 characters.	2019-03-28 10:37:57 +09:00
Ian Barwick	e9ece34aeb	log_db_error(): fix formatted message handling	2019-03-27 11:00:31 +09:00
Ian Barwick	539861cb58	repmgrd: during failover, check if a node was already promoted Previously, repmgrd assumed that during a failover, there would not already be another primary node. However it's possible a node was promoted manually. While this is not a desirable situation, it's conceivable this could happen in the wild, so we should check for it and react accordingly. Also sanity-check that the follow target can actually be followed. Addresses issue raised in GitHub #420.	2019-03-22 14:06:41 +09:00
Ian Barwick	314a1e8f4f	use a constant to denote unknown replication lag	2019-03-20 17:26:04 +09:00
Ian Barwick	b84d98fe81	Explictly log PQping() failures	2019-03-20 11:47:32 +09:00
Ian Barwick	46efe57cd0	Improve database connection failure logging Log the output of PQerrorStatus() in a couple of places where it was missing. Additionally, always log the output of PQerrorStatus() starting with a blank line, otherwise the first line looks like it was emitted by repmgr, and it's harder to scan the error message. Before: [2019-03-20 11:24:15] [DETAIL] could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5501? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5501? After: [2019-03-20 11:27:21] [DETAIL] could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5501? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5501?	2019-03-20 11:47:28 +09:00
Ian Barwick	39df55c39c	Check node recovery type before attempting to write an event record In some corner cases (e.g. immediately after a switchover) where the current primary has not yet been determined, the provided connection might not be writeable. This prevents error messages such as "cannot execute INSERT in a read-only transaction" generating unnecessary noise in the logs.	2019-03-18 15:26:16 +09:00
Ian Barwick	f54ff85cfa	Remove outdated comment This was only relevant for repmgr3 and earlier; in repmgr4 the schema is hard-coded.	2019-03-18 15:19:11 +09:00
Ian Barwick	19bf4d7434	Count witness and zero-priority nodes in visibility check	2019-03-14 11:17:51 +09:00
Ian Barwick	56d9f5b856	Ensure witness node sets last upstream seen time	2019-03-14 10:53:47 +09:00
Ian Barwick	1615353f48	repmgrd: optionally disconnect WAL receivers during failover This is intended to ensure that all nodes have a constant LSN while making the failover decision. This feature is experimental and needs to be explicitly enabled with the configuration file option "standby_disconnect_on_failover". Note enabling this option will result in a delay in the failover decision until the WAL receiver is disconnected on all nodes.	2019-03-06 15:53:57 +09:00
Ian Barwick	b4dcda37a1	*_transaction() functions: log error message text as DETAIL Per behaviour elsewhere.	2019-03-06 12:12:47 +09:00
Ian Barwick	63f7ad546e	repmgrd: add option "connection_check_type" This enable selection of the method repmgrd uses to check whether the upstream node is available. Possible values are: - "ping" (default): uses PQping() to check server availability - "connection": executes a query on the connection to check server availability (similar to repmgr3.x).	2019-03-06 12:09:54 +09:00
Ian Barwick	4b89cbd98d	Rename "..._primary_last_seen" functions to "..._upstream_last_seen" As that better reflects what they do.	2019-02-28 15:36:55 +09:00
Ian Barwick	0578053875	standby clone: check upstream connections after data copy operation With long-running copy operations, it's possible the connection(s) to the primary/source server may go away for some reason, so recheck their availability before attempting to reuse.	2019-02-26 14:37:05 +09:00
Ian Barwick	ea36609159	Add some missing query error logging	2019-02-23 16:54:07 +09:00
Ian Barwick	b72c894db4	repmgrd: during failover, check if other nodes have seen the primary In a situation where only some standbys are cut off from the primary, a failover would result in a split brain/split cluster situation, as it's likely one of the cut-off standbys will promote itself, and other cut-off standbys (but not all standbys) will follow it. To prevent this happening, interrogate the other sibiling nodes to check whether they've seen the primary within a reasonably short interval; if this is the case, do not take any failover action. This feature is experimental.	2019-02-23 13:03:22 +09:00
Ian Barwick	07097575b1	daemon status: add column "upstream last seen" This displays the interval (in seconds) since the repmgrd instance on each node last confirmed its upstream node is available.	2019-02-23 13:03:16 +09:00
Ian Barwick	71d151ca87	Don't check status of logical replication slots We only want to check the status of physical replication slots to determine whether a streaming replication standby has become detached and there is therefore a risk of uncontrolled WAL buildup on the local node. It's not feasible to second-guess the state of logical replication slots.	2019-02-23 10:09:43 +09:00
Ian Barwick	de70fd42dc	node check: simplify output generation in --is-shutdown-cleanly check	2019-02-22 10:49:06 +09:00
Ian Barwick	85a97c933f	Handle unhandled NodeStatus in switch statement	2019-02-15 19:31:06 +09:00
Ian Barwick	9305953bd2	Fix history file parsing Also add additional debugging output.	2019-02-14 15:52:40 +09:00
Ian Barwick	25019d1cc5	Refactor is_wal_replay_paused() query Make sure it doesn't emit an error if executed on a node not in recovery. The caller should theoretically only execute it on nodes in recovery, but there are sure to be corner cases where the node has come out of recovery.	2019-02-12 10:21:05 +09:00
Ian Barwick	f0a0be0248	Remove pointless default allocation in _get_node_record()	2019-02-07 11:41:08 +09:00
Ian Barwick	c7b325e2a4	Add function resume_wal_replay()	2019-02-07 11:33:02 +09:00
Ian Barwick	b89941f218	Store WAL replay pause status in ReplInfo struct	2019-02-07 10:24:42 +09:00
Ian Barwick	2b3b1faa20	refactor query in function get_replication_info() In particular handle all cases where one of the functions called in the query can return NULL in the query itself.	2019-02-06 15:40:27 +09:00
Ian Barwick	984ce7420b	"daemon status": emit warning if WAL replay is paused Specifically, if WAL replay is paused and WAL is pending replay, this node cannot be promoted until WAL replay is unpaused. In this state it is not a suitable promotion candidate in a failover situation.	2019-02-06 13:32:20 +09:00
Ian Barwick	cd3312496e	Rename functions which return an LSN for clarity	2019-02-06 09:32:53 +09:00
Ian Barwick	f62b3b2868	Fix Pg10+ function names	2019-02-05 13:37:35 +09:00
Ian Barwick	701944c194	"standby promote": add check for WAL replay status if replay is paused If WAL replay is paused but WAL is still pending replay, PostgreSQL will ignore the promote request until WAL replay is unpaused. This may lead to the standby being promoted at an unpredictable point in time outside of repmgr's control. Moreover it may not be obvious that this is happening, or why, and it will appear that an apparently successful promotion attempt has not actually worked. To prevent this from happening, repmgr will now refuse to promote the standy if WAL replay is paused and WAL is still pending replay. GitHub #540.	2019-02-05 13:30:37 +09:00
Ian Barwick	92c73b68a0	Clean up dbutils.c Put functions into the same "section" as noted in the header file.	2019-02-05 09:36:54 +09:00
Ian Barwick	f9a1861ded	Refactor ReplInfo struct handling Eventually we'll want to have this contain the optional replication info contained in the t_node_info struct, which should then contain a pointer to a ReplInfo struct.	2019-02-02 18:39:24 +09:00
Ian Barwick	20b79f998c	Define some previously magic numbers	2019-02-01 19:14:16 +09:00
Ian Barwick	bdb4f66a9d	Add an Assert() to detect attempted array overflow in param_set...() functions Previously the code would do nothing if an attempt was made to add parameters if the array is already full. As the array is designed to contain all valid libpq connection parameters, there's no reason it should ever "overflow" like this. If there is, then it means the caller is attempting to add invalid values. Add an Assert() so we can easily detect this in the unlikely event it ever occurs. Noted after examining the issue raised in GitHub #533, which is nonsensical as it implies we'd be OK with writing beyond the end of the array, however it doesn't hurt to make it a bit clearer what is happening and why.	2019-01-31 14:11:00 +09:00
Ian Barwick	32b81e7d49	"daemon start": initial implementation	2019-01-29 13:01:14 +09:00

1 2 3 4 5 ...

269 Commits