repmgr

mirror of https://github.com/EnterpriseDB/repmgr.git synced 2026-03-22 22:56:29 +00:00

Author	SHA1	Message	Date
Ian Barwick	46efe57cd0	Improve database connection failure logging Log the output of PQerrorStatus() in a couple of places where it was missing. Additionally, always log the output of PQerrorStatus() starting with a blank line, otherwise the first line looks like it was emitted by repmgr, and it's harder to scan the error message. Before: [2019-03-20 11:24:15] [DETAIL] could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5501? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5501? After: [2019-03-20 11:27:21] [DETAIL] could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5501? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5501?	2019-03-20 11:47:28 +09:00
Ian Barwick	19bf4d7434	Count witness and zero-priority nodes in visibility check	2019-03-14 11:17:51 +09:00
Ian Barwick	b1875a8d91	Split command execution functions into separate library These may need to be executed by repmgrd.	2019-02-27 14:41:17 +09:00
Ian Barwick	0578053875	standby clone: check upstream connections after data copy operation With long-running copy operations, it's possible the connection(s) to the primary/source server may go away for some reason, so recheck their availability before attempting to reuse.	2019-02-26 14:37:05 +09:00
Ian Barwick	99550b91bd	standby register: warn if standby is running and connection params provided Addresses GitHub #552.	2019-02-22 10:31:00 +09:00
Ian Barwick	f3fc4e5afb	Minor syntax formatting tweak For consistency.	2019-02-15 19:58:35 +09:00
Ian Barwick	7fad2ed2c8	standby switchover: improve error output It wasn't clear why repmgr thinks the demotion candidate is not the upstream of the promotion candidate.	2019-02-14 17:22:24 +09:00
Ian Barwick	464ec6bec3	Ensure conninfo param list is initialized for --recovery-conf-only option	2019-02-06 12:58:09 +09:00
Ian Barwick	3bbbf6daa9	"recovery_file_path" is MAXPGPATH	2019-02-06 10:42:09 +09:00
Ian Barwick	cd3312496e	Rename functions which return an LSN for clarity	2019-02-06 09:32:53 +09:00
Ian Barwick	cce8b76171	"standby switchover": abort if promotion candidate has WAL replay paused If replay is paused, we can't be really sure that more WAL will be received between the check and the promote operation, which would risk the promote operation not taking place during the switchover (it would happen as soon as WAL replay is resumed and pending WAL is replayed). Therefore we simply quit with an informative slew of messages and leave the user to sort it out. GitHub #540.	2019-02-05 16:32:39 +09:00
Ian Barwick	2a529e7e8b	"standby promote": don't promote if replay paused and in archive recovery It does not appear feasible to predict if there is still WAL waiting to be replayed from archive. In this case take no action. GitHub #540.	2019-02-05 14:39:08 +09:00
Ian Barwick	701944c194	"standby promote": add check for WAL replay status if replay is paused If WAL replay is paused but WAL is still pending replay, PostgreSQL will ignore the promote request until WAL replay is unpaused. This may lead to the standby being promoted at an unpredictable point in time outside of repmgr's control. Moreover it may not be obvious that this is happening, or why, and it will appear that an apparently successful promotion attempt has not actually worked. To prevent this from happening, repmgr will now refuse to promote the standy if WAL replay is paused and WAL is still pending replay. GitHub #540.	2019-02-05 13:30:37 +09:00
Ian Barwick	f9a1861ded	Refactor ReplInfo struct handling Eventually we'll want to have this contain the optional replication info contained in the t_node_info struct, which should then contain a pointer to a ReplInfo struct.	2019-02-02 18:39:24 +09:00
Ian Barwick	b9ba97a36d	"standby switchover": check replication connection to upstream Ensure repmgr checks the standby (promotion candidate) is currently attached to the primary (demotion candidate). Addresses issue reported in GitHub #519.	2019-02-01 15:28:06 +09:00
Ian Barwick	9273e7af73	"standby switchover": avoid potential race condition with WAL location check Immediately after the demotion candidate (primary) has shut down, we can't be absolutely sure that the walreceiver has flushed all WAL to disk, so checking pg_last_wal_receive_lsn() at that point might not reflect the actual last available WAL location. To handle this, we'll loop for a while (timeout controlled by configuration parameter "wal_receive_check_timeout") before finally deciding whether the standby is still behind the shut-down primary. Addresses issue raised in GitHub #518.	2019-02-01 12:06:22 +09:00
Ian Barwick	d7420d7274	daemon (start\|stop): verify that repmgrd starts/stops. Note this may not always be possible for "daemon stop" if we are unable to determine the repmgrd PID.	2019-01-30 14:41:31 +09:00
Ian Barwick	59eca2be30	node rejoin: improve error code handling - return ERR_REJOIN_FAIL in all cases where the rejoin operation fails - ensure ERR_FOLLOW_FAIL is not returned - document error codes	2019-01-24 10:31:45 +09:00
Ian Barwick	061932d023	"node rejoin": verify status of rejoin target This adapts the code previously added to "standby follow" to verify whether the rejoin target can actually be rejoined.	2019-01-23 17:08:55 +09:00
Ian Barwick	3f5762e03a	Refactor upstream attachment check code Move it from the "standby follow" code to an independent function so it can be used in other contexts, e.g. "node rejoin".	2019-01-23 15:11:42 +09:00
Ian Barwick	7dce3ed234	Update copyright notices to 2019	2019-01-21 14:54:35 +09:00
Ian Barwick	8881b69c06	"standby switchover": check remote data directory configuration The switchover will fail if the data_directory parameter in repmgr.conf on the remote node (demotion candidate) is incorrectly configured. We use the previously added "repmgr node check --data-directory-config to verify this, and abort early if an issue is discovered. Implements GitHub #523.	2019-01-16 16:03:49 +09:00
Ian Barwick	0b3a310802	Add --data-directory-config option to "repmgr node check" Implements part of GitHub #523.	2019-01-16 16:03:44 +09:00
Ian Barwick	d4e993a240	Improve handling of connection URIs when executing remote commands Previously, if connection URIs were in use and "repmgr standby switchover" was executed, repmgr would pass the connection URI as-is to the demotion candidate to execute "repmgr node rejoin". However the presence of unescaped ampersands in the connection URI was causing the rejoin command to be incorrectly executed. Addresses GitHub #525.	2019-01-14 11:11:51 +09:00
Ian Barwick	028c874f81	"standby follow": simplify check when follow target has higher timeline No need for a CHECKPOINT here, which simplifies things considerably.	2019-01-11 16:34:04 +09:00
Ian Barwick	b3c2831bd3	repmgr: add --dry-run option to "standby promote" Implements GitHub #522.	2019-01-10 12:36:58 +09:00
Ian Barwick	3389491151	Misc comment and log output corrections	2019-01-09 09:41:59 +09:00
Ian Barwick	313aa3c5d7	Refactor follow verification to reduce need for CHECKPOINT A CHECKPOINT is not always required; hopefully we can narrow it down to one corner case where we need to determine the minium recovery location. Also get local timeline ID via IDENTIFY_SYSTEM, as fetching it from pg_control risks returning the prior timeline ID if the timeline switch has just taken place and no restart point has yet occurred.	2018-12-04 15:27:22 +09:00
Ian Barwick	10d46f7e85	Fix variable name typo	2018-12-04 10:22:23 +09:00
Ian Barwick	9e90fcd584	"standby follow": verify status of follow target This commit adds infrastruture for repmgr to be able to check whether one standby can attach to another node, regardless whether it is a standby or a primary. This is intended to prevent a node from attempting to follow a node whose timeline has diverged. The --dry-run option makes it possible to test a follow operation before it is carried out. As a useful side-effect this makes it possible for a standby to follow another standby. This is an initial implementation; documentation and possibly further changes to follow.	2018-11-29 17:14:38 +09:00
Ian Barwick	66b40ffc68	Simplify function create_replication_slot() Following the changes in `793d83b`, it's no longer necessary to pass the server version number.	2018-11-29 14:35:01 +09:00
Ian Barwick	311f7e561e	"standby switchover": use empheral witness server connection Intended to prevent issue reported in GitHub #514.	2018-11-28 14:29:41 +09:00
Ian Barwick	793d83b22c	Refactor server version detection Most of the time we can simply get the version number directly from the connection handle. Previously it was held in a global variable, which was an icky way of doing things. In a few special cases we also need the actual version string, which is obtained directly from the database.	2018-11-22 21:30:31 +09:00
Ian Barwick	b223cb4cee	standby follow: improve handling of --upstream-node-id	2018-11-22 11:16:44 +09:00
Ian Barwick	c3bc5585d9	Add sanity check for extension version This should cover the cases where the "repmgr" extension was installed manually but not updated, or an upgrade was not fully completed.	2018-10-31 11:16:36 +09:00
Ian Barwick	c336e384ab	Support "pg_promote()" function (PostgreSQL 12 and later) This is an experimental feature.	2018-10-26 11:02:45 +09:00
Ian Barwick	dc8ffd30c6	"standby switchover": close all connections used to check repmgrd status The connections used to check repmgrd status on all nodes were not being closed if repmgrd was not running. Normally this wouldn't be a huge problem as they will go away when repmgr terminates or the PostgreSQL server restarted. However, if shutdown mode is "smart", the open connection on the demotion candidate will cause the shutdown operation to fail until repmgr times out.	2018-10-23 11:05:28 +09:00
Ian Barwick	36bd7cdc9f	Speed up witness "failover" during a switchover	2018-10-18 17:26:29 +09:00
Ian Barwick	15a5d2ee9d	"repmgr standby": use appendPQExpBufferStr/-Char() consistently	2018-10-03 17:31:12 +09:00
Ian Barwick	2491b8ae52	Add functionality to "pause" repmgrd In some circumstances, e.g. while performing a switchover, it is essential that repmgrd does not take any kind of failover action, as this will put the cluster into an incorrect state. Previously it was necessary to stop repmgrd on all nodes (or at least those nodes which repmgrd would consider as promotion candidates), however this is a cumbersome and potentially risk-prone operation, particularly if the replication cluster contains more than a couple of servers. To prevent this issue from occurring, this patch introduces the ability to "pause" repmgrd on all nodes wth a single command ("repmgr daemon pause") which notifies repmgrd not to take any failover action until the node is "unpaused" ("repmgr daemon unpause"). "repmgr daemon status" provides an overview of each node and whether repmgrd is running, and if so whether it is paused. "repmgr standby switchover" has been modified to automatically pause repmgrd while carrying out the switchover. See documentation for further details.	2018-09-27 16:42:10 +09:00
Ian Barwick	9439467958	doc: add troubleshooting section to switchover documentation	2018-09-25 13:47:58 +09:00
Ian Barwick	38e3aae053	repmgr: add parameter "shutdown_check_timeout" Previously, "repmgr standby switchover" used the configuration file parameters "reconnect_interval" and "reconnect_attempts" to define a timeout to determine whether the current primary (demotion candidate) has shut down. However, these parameters are intended for primary failure detection and are generally lower in value, while a controlled shutdown may take longer, resulting in the switchover being aborted as repmgr was not waiting long enough. To prevent this happening, parameter "shutdown_check_timeout" has been added. This complements the existing "standby_reconnect_timeout" parameter used by "repmgr standby switchover". Implements GitHub #504.	2018-09-25 11:34:06 +09:00
Ian Barwick	9681708b1a	repmgr: improve slot handling in "node rejoin" On the rejoined node, if a replication slot for the new upstream exists (which is typically the case after a failover), delete that slot. Also emit a warning about any inactive replication slots which may need to be cleaned up manually. GitHub #499.	2018-08-30 12:24:13 +09:00
Ian Barwick	7745844078	"standby switchover": improve replication connection check Previously repmgr would first check that a replication can be made from the demotion candidate to the promotion candidate, however it's preferable to sanity-check the number of available walsenders first, to provide a more useful error message.	2018-08-24 16:31:25 +09:00
Cédric Villemain	6fc79470fc	Fix grep to find conninfo it used to use \t* but [[:space:]] should be better as it does match more kind of spaces (the current one being broken in my case on RH7)	2018-08-23 18:33:55 +02:00
Ian Barwick	c3949b2aea	"standby clone" - don't copy external config files in dry run mode Avoid copying files during a --dry-run as it may introduce unexpected changes on the target node. During an actual clone operation, any problems with copying files will be detected early and the operation aborted before the actual database cloning commences. GitHub #491.	2018-08-20 15:23:37 +09:00
Ian Barwick	6ba49de44e	"standby promote": improve log messages Make it clearer what repmgr is waiting for, and what to do if the promotion appears to fail.	2018-08-16 11:52:01 +09:00
Ian Barwick	f2bc898761	repmgr: fix handling of slot creation error when cloning If cloning from another node other than the intended upstream, and replication slots are in use, once the cloning process is complete, repmgr will attempt to connect to the intended upstream to create the replication slot. Previously it would abort with a connection error, but as this issue is not fatal to the cloning process itself, and in some situations may be intentional, it's better to log a warning and continue. We should probably collate this (and any similar items needing attention after the cloning operation) into a list output at the end, otherwise the warning may get overlooked.	2018-08-15 15:12:23 +09:00
Abhijit Menon-Sen	97cafd8c54	Fix upstream node name in warning This log_warning is supposed to reproduce the error in the block above, but used the current node's name instead of the intended upstream node.	2018-08-12 09:15:13 +05:30
Ian Barwick	7ecfb333b9	doc: add note about switchover and exclusive backups Also rename server_not_in_exclusive_backup_mode() to avoid double negatives. GitHub #476.	2018-07-19 16:02:31 +09:00

... 2 3 4 5 6 ...

412 Commits