Compare commits

...

65 Commits

Author SHA1 Message Date
Ian Barwick
3ec43eda36 doc: remove references to "primary_visibility_consensus"
Feature remains experimental.
2019-03-18 17:43:16 +09:00
Ian Barwick
ce8e1cccc4 Remove outdated comment
This was only relevant for repmgr3 and earlier; in repmgr4 the schema
is hard-coded.
2019-03-18 15:19:25 +09:00
Ian Barwick
70bfa4c8e1 Clarify calls to check_primary_status()
Use a constant rather than a magic number to indicate non-provision
of elapsed degraded monitoring time.
2019-03-18 14:21:41 +09:00
Ian Barwick
f0d5ad503d doc: clarify "cluster show" error codes 2019-03-18 10:50:05 +09:00
John Naylor
b9ee57ee0f Fix assorted Makefile bugs
1. The target additional-maintainer-clean was misspelled as
maintainer-additional-clean.

2. Add add missing clean targets, in particular sysutils.o, config.h,
repmgr_version.h, and Makefile.global. While at it, use a wildcard
for obj files.

3. Don't delete configure.

4. Remove generated file doc/version.sgml from the repo.

5. Have maintainer-clean recurse to the doc directory.
2019-03-15 16:30:27 +09:00
Ian Barwick
d5d6ed4be7 Bump version
4.3rc1
2019-03-15 14:41:41 +09:00
Ian Barwick
f4655074ae doc: miscellaenous cleanup 2019-03-15 14:39:55 +09:00
Ian Barwick
67d26ab7e2 doc: tweak wording in event notification documentation 2019-03-15 14:08:18 +09:00
Ian Barwick
70a7b45a03 doc: add explanation of the configuration file format 2019-03-15 14:07:19 +09:00
Ian Barwick
4251590833 doc: update "connection_check_type" descriptions 2019-03-15 14:07:13 +09:00
Ian Barwick
9347d34ce0 repmgrd: optionally check upstream availability through connection attempts 2019-03-15 14:07:08 +09:00
John Naylor
feb90ee50c Correct some doc typos 2019-03-15 14:07:05 +09:00
Ian Barwick
0a6486bb7f doc: expand "standby_disconnect_on_failover" documentation 2019-03-15 14:07:01 +09:00
Ian Barwick
39443bbcee Count witness and zero-priority nodes in visibility check 2019-03-15 14:06:58 +09:00
Ian Barwick
fc636b1bd2 Ensure witness node sets last upstream seen time 2019-03-15 14:06:55 +09:00
Ian Barwick
048bad1c88 doc: fix option name typo 2019-03-15 14:06:51 +09:00
Ian Barwick
4528eb1796 doc: expand "failover_validate_command" documentation 2019-03-15 14:06:37 +09:00
Ian Barwick
169c9ccd32 repmgrd: improve logging output when executing "failover_validate_command" 2019-03-15 14:06:34 +09:00
Ian Barwick
5f92fbddf2 doc: various updates 2019-03-15 14:06:30 +09:00
Ian Barwick
617e466f72 doc: merge repmgrd witness server description into failover section 2019-03-13 16:19:41 +09:00
Ian Barwick
435fac297b doc: merge repmgrd split network handling description into failover section 2019-03-13 16:19:37 +09:00
Ian Barwick
4bc12b4c94 doc: merge repmgrd monitoring description into operating section 2019-03-13 16:19:33 +09:00
Ian Barwick
91234994e2 doc: merge repmgrd degraded monitoring description into operation section 2019-03-13 16:19:30 +09:00
Ian Barwick
ee9da30f20 doc: merge repmgrd notes into operation documentation 2019-03-13 16:19:27 +09:00
Ian Barwick
2e67bc1341 doc: merge repmgrd pause documentation into overview 2019-03-13 16:19:24 +09:00
Ian Barwick
18ab5cab4e doc: initial repmgrd doc refactoring 2019-03-13 16:19:20 +09:00
Ian Barwick
60bb4e9fc8 doc: update repmgrd configuration documentation 2019-03-13 16:19:17 +09:00
Ian Barwick
52bee6b98d repmgrd: various minor logging improvements 2019-03-13 16:19:13 +09:00
Ian Barwick
ecb1f379f5 repmgrd: remove global variable
Make the "sibling_nodes" local, and pass by reference where relevant.
2019-03-13 16:19:10 +09:00
Ian Barwick
e1cd2c22d4 repmgrd: enable election rerun
If "failover_validation_command" is set, and the command returns an error,
rerun the election.

There is a pause between reruns to avoid "churn"; the length of this pause
is controlled by the configuration parameter "election_rerun_interval".
2019-03-13 16:19:03 +09:00
Ian Barwick
1dea6b76d9 Remove redundant struct allocation 2019-03-13 16:19:00 +09:00
Ian Barwick
702f90fc9d doc: update list of reloadable repmgrd configuration options 2019-03-13 16:18:56 +09:00
Ian Barwick
c4d1eec6f3 doc: document "failover_validation_command" 2019-03-13 16:18:53 +09:00
Ian Barwick
b241c606c0 doc: expand repmgrd configuration section 2019-03-13 16:18:50 +09:00
Ian Barwick
45c896d716 Execute "failover_validation_command" when only one standby exists 2019-03-08 15:29:17 +09:00
Ian Barwick
514595ea10 Make "failover_validation_command" reloadable 2019-03-08 15:29:12 +09:00
Ian Barwick
531194fa27 Initial implementation of "failover_validation_command" 2019-03-08 15:29:06 +09:00
Ian Barwick
2aa67c992c Make recently added configuration options reloadable 2019-03-08 15:28:59 +09:00
Ian Barwick
37892afcfc Add configuration option "primary_visibility_consensus"
This determines whether repmgrd should continue with a failover if
one or more nodes report they can still see the standby.
2019-03-08 15:28:53 +09:00
Ian Barwick
e4e5e35552 Add configuration option "sibling_nodes_disconnect_timeout"
This controls the maximum length of time in seconds that repmgrd will
wait for other standbys to disconnect their WAL receivers in a failover
situation.

This setting is only used when "standby_disconnect_on_failover" is set to "true".
2019-03-08 15:28:48 +09:00
Ian Barwick
b320c1f0ae Reset "wal_retrieve_retry_interval" for all nodes 2019-03-08 15:28:42 +09:00
Ian Barwick
280654bed6 repmgrd: don't wait for WAL receiver to reconnect during failover
If the WAL receiver has been temporarily disabled, we don't want to
wait for it to start up as it may not be able to at that point; we do
however need to reset "wal_retrieve_retry_interval".
2019-03-08 15:28:27 +09:00
Ian Barwick
ae675059c0 Improve logging/sanity checking for "node control" options 2019-03-08 15:28:22 +09:00
Ian Barwick
454ebabe89 Improve logging when disabling/enabling WAL receiver
Also check action is being run on node which is in recovery.
2019-03-08 15:28:17 +09:00
Ian Barwick
d1d6ef8d12 Check for WAL receiver start up 2019-03-08 15:28:11 +09:00
Ian Barwick
5d6eab74f6 Log warning if "standby_disconnect_on_failover" used on pre-9.5
"standby_disconnect_on_failover" requires availability of "wal_retrieve_retry_interval",
which is available from PostgreSQL 9.5.

9.4 will fall out of community support this year, so it doesn't seem
productive at this point to do anything more than put the onus on the user
to read the documentation and heed any warning messages in the logs.
2019-03-08 15:28:01 +09:00
Ian Barwick
59b7453bbf repmgrd: optionally disconnect WAL receivers during failover
This is intended to ensure that all nodes have a constant LSN while
making the failover decision.

This feature is experimental and needs to be explicitly enabled with the
configuration file option "standby_disconnect_on_failover".

Note enabling this option will result in a delay in the failover decision
until the WAL receiver is disconnected on all nodes.
2019-03-08 15:27:54 +09:00
Ian Barwick
bde8c7e29c repmgrd: handle reconnect to restarted server when using "connection" checks 2019-03-08 15:27:49 +09:00
Ian Barwick
bc6584a90d *_transaction() functions: log error message text as DETAIL
Per behaviour elsewhere.
2019-03-06 13:23:57 +09:00
Ian Barwick
074d79b44f repmgrd: add option "connection_check_type"
This enable selection of the method repmgrd uses to check whether the upstream
node is available. Possible values are:

 - "ping" (default): uses PQping() to check server availability
 - "connection":  executes a query on the connection to check server
   availability (similar to repmgr3.x).
2019-03-06 13:23:53 +09:00
Ian Barwick
2eeb288573 repmgrd: ignore invalid "upstream_last_seen" value 2019-03-06 13:23:47 +09:00
Ian Barwick
48a2274b11 Use appendPQExpBufferStr where approrpriate 2019-03-06 13:23:38 +09:00
Ian Barwick
19bcfa7264 Rename "..._primary_last_seen" functions to "..._upstream_last_seen"
As that better reflects what they do.
2019-03-06 13:23:33 +09:00
Ian Barwick
486877c3d5 repmgrd: log details of nodes which can see primary
If a failover is cancelled because other nodes can still see the primary,
log the identies of those nodes.
2019-03-06 13:23:27 +09:00
Ian Barwick
9753bcc8c3 repmgrd: during failover, check if other nodes have seen the primary
In a situation where only some standbys are cut off from the primary,
a failover would result in a split brain/split cluster situation,
as it's likely one of the cut-off standbys will promote itself, and
other cut-off standbys (but not all standbys) will follow it.

To prevent this happening, interrogate the other sibiling nodes to
check whether they've seen the primary within a reasonably short interval;
if this is the case, do not take any failover action.

This feature is experimental.
2019-03-06 13:23:22 +09:00
Ian Barwick
bd35b450da daemon status: with csv output, show repmgrd status as unknown where appropriate
Previously, if PostgreSQL was not running on the node, repmgrd and
pause status were shown as "0", implying their status was known.

This brings the csv output in line with the human-readable output,
which displays "n/a" in this case.
2019-02-28 12:28:04 +09:00
Ian Barwick
1f256d4d73 doc: upate release notes 2019-02-28 10:02:05 +09:00
Ian Barwick
1524e2449f Split command execution functions into separate library
These may need to be executed by repmgrd.
2019-02-27 14:41:38 +09:00
Ian Barwick
0cd2bd2e91 repmgrd: add additional logging during a failover operation 2019-02-27 11:45:34 +09:00
Ian Barwick
98b78df16c Remove unneeded debugging output 2019-02-26 21:17:17 +09:00
Ian Barwick
b946dce2f0 doc: update introductory blurb 2019-02-26 15:19:41 +09:00
Ian Barwick
39234afcbf standby clone: check upstream connections after data copy operation
With long-running copy operations, it's possible the connection(s) to
the primary/source server may go away for some reason, so recheck
their availability before attempting to reuse.
2019-02-26 14:37:51 +09:00
John Naylor
23569a19b1 Doc fix: PostgreSQL 9.4 is no longer considered recent 2019-02-25 13:02:56 +09:00
John Naylor
c650fd3412 Fix typo 2019-02-25 13:02:51 +09:00
Ian Barwick
c30e65b3f2 Add some missing query error logging 2019-02-25 13:02:45 +09:00
48 changed files with 2786 additions and 1082 deletions

20
HISTORY
View File

@@ -12,27 +12,15 @@
data directory on the demotion candidate; GitHub #523 (Ian) data directory on the demotion candidate; GitHub #523 (Ian)
repmgr: ensure "standby switchover" verifies replication connection repmgr: ensure "standby switchover" verifies replication connection
exists; GitHub #519 (Ian) exists; GitHub #519 (Ian)
repmgr: ensure "primary unregister" behaves correctly when executed repmgr: add sanity check for correct extension version (Ian)
on a witness server; GitHub #548 (Ian) repmgr: ensure "witness register --dry-run" does not attempt to read node
repmgr: when executing "standby follow" and "node rejoin", check that tables if repmgr extension not installed; GitHub #513 (Ian)
it will actually be possible to stream from the target node (Ian)
repmgr: "standby switchover": improve handling of connection URIs when
executing "node rejoin" on the demotion candidate; GitHub #525 (Ian)
repmgr: fix long node ID display in "cluster show" (Ian)
repmgr: check for primary server before executing "witness register";
GitHub #538 (Ian)
repmgr: show "upstream last seen" interval in "daemon status" output (Ian)
repmgr: "node check" will only consider physical replication slots (Ian)
repmgrd: check binary and extension major versions match; GitHub #515 (Ian) repmgrd: check binary and extension major versions match; GitHub #515 (Ian)
repmgrd: on a cascaded standby, don't fail over if "failover=manual"; repmgrd: on a cascaded standby, don't fail over if "failover=manual";
GitHub #531 (Ian) GitHub #531 (Ian)
repmgrd: don't consider nodes where repmgrd is not running as promotion repmgrd: don't consider nodes where repmgrd is not running as promotion
candidates (Ian) candidates (Ian)
repmgrd: add option "connection_check_type" (Ian)
4.2.1 2018-??-??
repmgr: add sanity check for correct extension version (Ian)
repmgr: ensure "witness register --dry-run" does not attempt to read node
tables if repmgr extension not installed; GitHub #513 (Ian)
repmgrd: improve witness monitoring when primary node not available (Ian) repmgrd: improve witness monitoring when primary node not available (Ian)
4.2 2018-10-24 4.2 2018-10-24

View File

@@ -50,8 +50,8 @@ $(info Building against PostgreSQL $(MAJORVERSION))
REPMGR_CLIENT_OBJS = repmgr-client.o \ REPMGR_CLIENT_OBJS = repmgr-client.o \
repmgr-action-primary.o repmgr-action-standby.o repmgr-action-witness.o \ repmgr-action-primary.o repmgr-action-standby.o repmgr-action-witness.o \
repmgr-action-bdr.o repmgr-action-cluster.o repmgr-action-node.o repmgr-action-daemon.o \ repmgr-action-bdr.o repmgr-action-cluster.o repmgr-action-node.o repmgr-action-daemon.o \
configfile.o log.o strutil.o controldata.o dirutil.o compat.o dbutils.o configfile.o log.o strutil.o controldata.o dirutil.o compat.o dbutils.o sysutils.o
REPMGRD_OBJS = repmgrd.o repmgrd-physical.o repmgrd-bdr.o configfile.o log.o dbutils.o strutil.o controldata.o compat.o REPMGRD_OBJS = repmgrd.o repmgrd-physical.o repmgrd-bdr.o configfile.o log.o dbutils.o strutil.o controldata.o compat.o sysutils.o
DATE=$(shell date "+%Y-%m-%d") DATE=$(shell date "+%Y-%m-%d")
repmgr_version.h: repmgr_version.h.in repmgr_version.h: repmgr_version.h.in
@@ -86,29 +86,15 @@ clean: additional-clean
maintainer-clean: additional-maintainer-clean maintainer-clean: additional-maintainer-clean
additional-clean: additional-clean:
rm -f repmgr-client.o rm -f *.o
rm -f repmgr-action-primary.o
rm -f repmgr-action-standby.o
rm -f repmgr-action-witness.o
rm -f repmgr-action-bdr.o
rm -f repmgr-action-node.o
rm -f repmgr-action-cluster.o
rm -f repmgr-action-daemon.o
rm -f repmgrd.o
rm -f repmgrd-physical.o
rm -f repmgrd-bdr.o
rm -f compat.o
rm -f configfile.o
rm -f controldata.o
rm -f dbutils.o
rm -f dirutil.o
rm -f log.o
rm -f strutil.o
maintainer-additional-clean: clean additional-maintainer-clean: clean
rm -f configure $(MAKE) -C doc maintainer-clean
rm -f config.status config.log rm -f config.status config.log
rm -f config.h
rm -f repmgr_version.h
rm -f Makefile rm -f Makefile
rm -f Makefile.global
@rm -rf autom4te.cache/ @rm -rf autom4te.cache/
ifeq ($(MAJORVERSION),$(filter $(MAJORVERSION),9.3 9.4)) ifeq ($(MAJORVERSION),$(filter $(MAJORVERSION),9.3 9.4))

View File

@@ -358,6 +358,12 @@ _parse_config(t_configuration_options *options, ItemList *error_list, ItemList *
options->primary_notification_timeout = DEFAULT_PRIMARY_NOTIFICATION_TIMEOUT; options->primary_notification_timeout = DEFAULT_PRIMARY_NOTIFICATION_TIMEOUT;
options->repmgrd_standby_startup_timeout = -1; /* defaults to "standby_reconnect_timeout" if not set */ options->repmgrd_standby_startup_timeout = -1; /* defaults to "standby_reconnect_timeout" if not set */
memset(options->repmgrd_pid_file, 0, sizeof(options->repmgrd_pid_file)); memset(options->repmgrd_pid_file, 0, sizeof(options->repmgrd_pid_file));
options->standby_disconnect_on_failover = false;
options->sibling_nodes_disconnect_timeout = DEFAULT_SIBLING_NODES_DISCONNECT_TIMEOUT;
options->connection_check_type = CHECK_PING;
options->primary_visibility_consensus = false;
memset(options->failover_validation_command, 0, sizeof(options->failover_validation_command));
options->election_rerun_interval = DEFAULT_ELECTION_RERUN_INTERVAL;
/*------------- /*-------------
* witness settings * witness settings
@@ -618,6 +624,36 @@ _parse_config(t_configuration_options *options, ItemList *error_list, ItemList *
options->repmgrd_standby_startup_timeout = repmgr_atoi(value, name, error_list, 0); options->repmgrd_standby_startup_timeout = repmgr_atoi(value, name, error_list, 0);
else if (strcmp(name, "repmgrd_pid_file") == 0) else if (strcmp(name, "repmgrd_pid_file") == 0)
strncpy(options->repmgrd_pid_file, value, MAXPGPATH); strncpy(options->repmgrd_pid_file, value, MAXPGPATH);
else if (strcmp(name, "standby_disconnect_on_failover") == 0)
options->standby_disconnect_on_failover = parse_bool(value, name, error_list);
else if (strcmp(name, "sibling_nodes_disconnect_timeout") == 0)
options->sibling_nodes_disconnect_timeout = repmgr_atoi(value, name, error_list, 0);
else if (strcmp(name, "connection_check_type") == 0)
{
if (strcasecmp(value, "ping") == 0)
{
options->connection_check_type = CHECK_PING;
}
else if (strcasecmp(value, "connection") == 0)
{
options->connection_check_type = CHECK_CONNECTION;
}
else if (strcasecmp(value, "query") == 0)
{
options->connection_check_type = CHECK_QUERY;
}
else
{
item_list_append(error_list,
_("value for \"connection_check_type\" must be \"ping\" or \"connection\"\n"));
}
}
else if (strcmp(name, "primary_visibility_consensus") == 0)
options->primary_visibility_consensus = parse_bool(value, name, error_list);
else if (strcmp(name, "failover_validation_command") == 0)
strncpy(options->failover_validation_command, value, sizeof(options->failover_validation_command));
else if (strcmp(name, "election_rerun_interval") == 0)
options->election_rerun_interval = repmgr_atoi(value, name, error_list, 0);
/* witness settings */ /* witness settings */
else if (strcmp(name, "witness_sync_interval") == 0) else if (strcmp(name, "witness_sync_interval") == 0)
@@ -1049,15 +1085,19 @@ parse_time_unit_parameter(const char *name, const char *value, char *dest, ItemL
* loop is started up; it therefore only needs to reload options required * loop is started up; it therefore only needs to reload options required
* by repmgrd, which are as follows: * by repmgrd, which are as follows:
* *
* changeable options: * changeable options (keep the list in "doc/repmgrd-configuration.sgml" in sync
* with these):
*
* - async_query_timeout * - async_query_timeout
* - bdr_local_monitoring_only * - bdr_local_monitoring_only
* - bdr_recovery_timeout * - bdr_recovery_timeout
* - connection_check_type
* - conninfo * - conninfo
* - degraded_monitoring_timeout * - degraded_monitoring_timeout
* - event_notification_command * - event_notification_command
* - event_notifications * - event_notifications
* - failover * - failover
* - failover_validation_command
* - follow_command * - follow_command
* - log_facility * - log_facility
* - log_file * - log_file
@@ -1065,12 +1105,19 @@ parse_time_unit_parameter(const char *name, const char *value, char *dest, ItemL
* - log_status_interval * - log_status_interval
* - monitor_interval_secs * - monitor_interval_secs
* - monitoring_history * - monitoring_history
* - primary_notification_timeout
* - primary_visibility_consensus
* - promote_command * - promote_command
* - promote_delay
* - reconnect_attempts * - reconnect_attempts
* - reconnect_interval * - reconnect_interval
* - repmgrd_standby_startup_timeout * - repmgrd_standby_startup_timeout
* - retry_promote_interval_secs * - retry_promote_interval_secs
* - sibling_nodes_disconnect_timeout
* - standby_disconnect_on_failover
*
*
* Not publicly documented:
* - promote_delay
* *
* non-changeable options (repmgrd references these from the "repmgr.nodes" * non-changeable options (repmgrd references these from the "repmgr.nodes"
* table, not the configuration file) * table, not the configuration file)
@@ -1155,7 +1202,6 @@ reload_config(t_configuration_options *orig_options, t_server_type server_type)
return false; return false;
} }
/* /*
* No configuration problems detected - copy any changed values * No configuration problems detected - copy any changed values
* *
@@ -1205,8 +1251,8 @@ reload_config(t_configuration_options *orig_options, t_server_type server_type)
{ {
strncpy(orig_options->conninfo, new_options.conninfo, MAXLEN); strncpy(orig_options->conninfo, new_options.conninfo, MAXLEN);
log_info(_("\"conninfo\" is now \"%s\""), new_options.conninfo); log_info(_("\"conninfo\" is now \"%s\""), new_options.conninfo);
} }
PQfinish(conn); PQfinish(conn);
} }
@@ -1284,7 +1330,6 @@ reload_config(t_configuration_options *orig_options, t_server_type server_type)
config_changed = true; config_changed = true;
} }
/* promote_command */ /* promote_command */
if (strncmp(orig_options->promote_command, new_options.promote_command, MAXLEN) != 0) if (strncmp(orig_options->promote_command, new_options.promote_command, MAXLEN) != 0)
{ {
@@ -1330,6 +1375,51 @@ reload_config(t_configuration_options *orig_options, t_server_type server_type)
config_changed = true; config_changed = true;
} }
/* standby_disconnect_on_failover */
if (orig_options->standby_disconnect_on_failover != new_options.standby_disconnect_on_failover)
{
orig_options->standby_disconnect_on_failover = new_options.standby_disconnect_on_failover;
log_info(_("\"standby_disconnect_on_failover\" is now \"%s\""),
new_options.standby_disconnect_on_failover == true ? "TRUE" : "FALSE");
config_changed = true;
}
/* sibling_nodes_disconnect_timeout */
if (orig_options->sibling_nodes_disconnect_timeout != new_options.sibling_nodes_disconnect_timeout)
{
orig_options->sibling_nodes_disconnect_timeout = new_options.sibling_nodes_disconnect_timeout;
log_info(_("\"sibling_nodes_disconnect_timeout\" is now \"%i\""),
new_options.sibling_nodes_disconnect_timeout);
config_changed = true;
}
/* connection_check_type */
if (orig_options->connection_check_type != new_options.connection_check_type)
{
orig_options->connection_check_type = new_options.connection_check_type;
log_info(_("\"connection_check_type\" is now \"%s\""),
new_options.connection_check_type == CHECK_PING ? "ping" : "connection");
config_changed = true;
}
/* primary_visibility_consensus */
if (orig_options->primary_visibility_consensus != new_options.primary_visibility_consensus)
{
orig_options->primary_visibility_consensus = new_options.primary_visibility_consensus;
log_info(_("\"primary_visibility_consensus\" is now \"%s\""),
new_options.primary_visibility_consensus == true ? "TRUE" : "FALSE");
config_changed = true;
}
/* failover_validation_command */
if (strncmp(orig_options->failover_validation_command, new_options.failover_validation_command, MAXPGPATH) != 0)
{
strncpy(orig_options->failover_validation_command, new_options.failover_validation_command, MAXPGPATH);
log_info(_("\"failover_validation_command\" is now \"%s\""), new_options.failover_validation_command);
config_changed = true;
}
/* /*
* Handle changes to logging configuration * Handle changes to logging configuration
*/ */

View File

@@ -37,6 +37,13 @@ typedef enum
FAILOVER_AUTOMATIC FAILOVER_AUTOMATIC
} failover_mode_opt; } failover_mode_opt;
typedef enum
{
CHECK_PING,
CHECK_QUERY,
CHECK_CONNECTION
} ConnectionCheckType;
typedef struct EventNotificationListCell typedef struct EventNotificationListCell
{ {
struct EventNotificationListCell *next; struct EventNotificationListCell *next;
@@ -135,6 +142,12 @@ typedef struct
int primary_notification_timeout; int primary_notification_timeout;
int repmgrd_standby_startup_timeout; int repmgrd_standby_startup_timeout;
char repmgrd_pid_file[MAXPGPATH]; char repmgrd_pid_file[MAXPGPATH];
bool standby_disconnect_on_failover;
int sibling_nodes_disconnect_timeout;
ConnectionCheckType connection_check_type;
bool primary_visibility_consensus;
char failover_validation_command[MAXPGPATH];
int election_rerun_interval;
/* BDR settings */ /* BDR settings */
bool bdr_local_monitoring_only; bool bdr_local_monitoring_only;
@@ -206,7 +219,8 @@ typedef struct
false, -1, \ false, -1, \
DEFAULT_ASYNC_QUERY_TIMEOUT, \ DEFAULT_ASYNC_QUERY_TIMEOUT, \
DEFAULT_PRIMARY_NOTIFICATION_TIMEOUT, \ DEFAULT_PRIMARY_NOTIFICATION_TIMEOUT, \
-1, "", \ -1, "", false, DEFAULT_SIBLING_NODES_DISCONNECT_TIMEOUT, \
CHECK_PING, true, "", DEFAULT_ELECTION_RERUN_INTERVAL, \
/* BDR settings */ \ /* BDR settings */ \
false, DEFAULT_BDR_RECOVERY_TIMEOUT, \ false, DEFAULT_BDR_RECOVERY_TIMEOUT, \
/* service settings */ \ /* service settings */ \

214
dbutils.c
View File

@@ -821,8 +821,8 @@ begin_transaction(PGconn *conn)
if (PQresultStatus(res) != PGRES_COMMAND_OK) if (PQresultStatus(res) != PGRES_COMMAND_OK)
{ {
log_error(_("unable to begin transaction:\n %s"), log_error(_("unable to begin transaction"));
PQerrorMessage(conn)); log_detail("%s", PQerrorMessage(conn));
PQclear(res); PQclear(res);
return false; return false;
@@ -845,8 +845,8 @@ commit_transaction(PGconn *conn)
if (PQresultStatus(res) != PGRES_COMMAND_OK) if (PQresultStatus(res) != PGRES_COMMAND_OK)
{ {
log_error(_("unable to commit transaction:\n %s"), log_error(_("unable to commit transaction"));
PQerrorMessage(conn)); log_detail("%s", PQerrorMessage(conn));
PQclear(res); PQclear(res);
return false; return false;
@@ -869,8 +869,8 @@ rollback_transaction(PGconn *conn)
if (PQresultStatus(res) != PGRES_COMMAND_OK) if (PQresultStatus(res) != PGRES_COMMAND_OK)
{ {
log_error(_("unable to rollback transaction:\n %s"), log_error(_("unable to rollback transaction"));
PQerrorMessage(conn)); log_detail("%s", PQerrorMessage(conn));
PQclear(res); PQclear(res);
return false; return false;
@@ -1079,7 +1079,7 @@ get_pg_setting(PGconn *conn, const char *setting, char *output)
} }
else else
{ {
/* XXX highly unlikely this would ever happen */ /* highly unlikely this would ever happen */
log_error(_("get_pg_setting(): unknown parameter \"%s\""), PQgetvalue(res, i, 0)); log_error(_("get_pg_setting(): unknown parameter \"%s\""), PQgetvalue(res, i, 0));
} }
} }
@@ -1096,6 +1096,56 @@ get_pg_setting(PGconn *conn, const char *setting, char *output)
} }
bool
alter_system_int(PGconn *conn, const char *name, int value)
{
PQExpBufferData query;
PGresult *res = NULL;
bool success = false;
initPQExpBuffer(&query);
appendPQExpBuffer(&query,
"ALTER SYSTEM SET %s = %i",
name, value);
res = PQexec(conn, query.data);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
{
log_db_error(conn, query.data, _("alter_system_int() - unable to execute query"));
success = false;
}
termPQExpBuffer(&query);
PQclear(res);
return success;
}
bool
pg_reload_conf(PGconn *conn)
{
PGresult *res = NULL;
bool success = false;
res = PQexec(conn, "SELECT pg_catalog.pg_reload_conf()");
if (PQresultStatus(res) != PGRES_TUPLES_OK)
{
log_db_error(conn, NULL, _("pg_reload_conf() - unable to execute query"));
success = false;
}
PQclear(res);
return success;
}
/* ============================ */ /* ============================ */
/* Server information functions */ /* Server information functions */
/* ============================ */ /* ============================ */
@@ -1503,6 +1553,8 @@ identify_system(PGconn *repl_conn, t_system_identification *identification)
if (PQresultStatus(res) != PGRES_TUPLES_OK || !PQntuples(res)) if (PQresultStatus(res) != PGRES_TUPLES_OK || !PQntuples(res))
{ {
log_db_error(repl_conn, NULL, _("unable to execute IDENTIFY_SYSTEM"));
PQclear(res); PQclear(res);
return false; return false;
} }
@@ -1621,6 +1673,7 @@ repmgrd_set_local_node_id(PGconn *conn, int local_node_id)
{ {
PQExpBufferData query; PQExpBufferData query;
PGresult *res = NULL; PGresult *res = NULL;
bool success = true;
initPQExpBuffer(&query); initPQExpBuffer(&query);
@@ -1629,16 +1682,18 @@ repmgrd_set_local_node_id(PGconn *conn, int local_node_id)
local_node_id); local_node_id);
res = PQexec(conn, query.data); res = PQexec(conn, query.data);
termPQExpBuffer(&query);
if (PQresultStatus(res) != PGRES_TUPLES_OK) if (PQresultStatus(res) != PGRES_TUPLES_OK)
{ {
PQclear(res); log_db_error(conn, query.data, _("repmgrd_set_local_node_id(): unable to execute query"));
return false;
success = false;
} }
termPQExpBuffer(&query);
PQclear(res); PQclear(res);
return true;
return success;
} }
@@ -1854,6 +1909,29 @@ repmgrd_pause(PGconn *conn, bool pause)
return success; return success;
} }
pid_t
get_wal_receiver_pid(PGconn *conn)
{
PGresult *res = NULL;
pid_t wal_receiver_pid = UNKNOWN_PID;
res = PQexec(conn, "SELECT repmgr.get_wal_receiver_pid()");
if (PQresultStatus(res) != PGRES_TUPLES_OK)
{
log_error(_("unable to execute \"SELECT repmgr.get_wal_receiver_pid()\""));
log_detail("%s", PQerrorMessage(conn));
}
else if (!PQgetisnull(res, 0, 0))
{
wal_receiver_pid = atoi(PQgetvalue(res, 0, 0));
}
PQclear(res);
return wal_receiver_pid;
}
/* ================ */ /* ================ */
/* result functions */ /* result functions */
/* ================ */ /* ================ */
@@ -2082,6 +2160,8 @@ _get_node_record(PGconn *conn, char *sqlquery, t_node_info *node_info, bool init
if (PQresultStatus(res) != PGRES_TUPLES_OK) if (PQresultStatus(res) != PGRES_TUPLES_OK)
{ {
log_db_error(conn, sqlquery, _("_get_node_record(): unable to execute query"));
PQclear(res); PQclear(res);
return RECORD_ERROR; return RECORD_ERROR;
} }
@@ -2991,13 +3071,15 @@ update_node_record_conn_priority(PGconn *conn, t_configuration_options *options)
options->node_id); options->node_id);
res = PQexec(conn, query.data); res = PQexec(conn, query.data);
termPQExpBuffer(&query);
if (PQresultStatus(res) != PGRES_COMMAND_OK) if (PQresultStatus(res) != PGRES_COMMAND_OK)
{ {
log_db_error(conn, query.data, _("update_node_record_conn_priority(): unable to execute query"));
success = false; success = false;
} }
termPQExpBuffer(&query);
PQclear(res); PQclear(res);
return success; return success;
@@ -3464,10 +3546,6 @@ _create_event(PGconn *conn, t_configuration_options *options, int node_id, char
/* /*
* Only attempt to write a record if a connection handle was provided. * Only attempt to write a record if a connection handle was provided.
* Also check that the repmgr schema has been properly initialised - if
* not it means no configuration file was provided, which can happen with
* e.g. `repmgr standby clone`, and we won't know which schema to write
* to.
*/ */
if (conn != NULL && PQstatus(conn) == CONNECTION_OK) if (conn != NULL && PQstatus(conn) == CONNECTION_OK)
{ {
@@ -4123,7 +4201,8 @@ cancel_query(PGconn *conn, int timeout)
*/ */
if (PQcancel(pgcancel, errbuf, ERRBUFF_SIZE) == 0) if (PQcancel(pgcancel, errbuf, ERRBUFF_SIZE) == 0)
{ {
log_warning(_("unable to stop current query:\n %s"), errbuf); log_warning(_("unable to cancel current query"));
log_detail("%s", errbuf);
PQfreeCancel(pgcancel); PQfreeCancel(pgcancel);
return false; return false;
} }
@@ -4141,7 +4220,7 @@ cancel_query(PGconn *conn, int timeout)
* Returns 1 for success; 0 if any error ocurred; -1 if timeout reached. * Returns 1 for success; 0 if any error ocurred; -1 if timeout reached.
*/ */
int int
wait_connection_availability(PGconn *conn, long long timeout) wait_connection_availability(PGconn *conn, int timeout)
{ {
PGresult *res = NULL; PGresult *res = NULL;
fd_set read_set; fd_set read_set;
@@ -4150,16 +4229,17 @@ wait_connection_availability(PGconn *conn, long long timeout)
before, before,
after; after;
struct timezone tz; struct timezone tz;
long long timeout_ms;
/* recalc to microseconds */ /* calculate timeout in microseconds */
timeout *= 1000000; timeout_ms = timeout * 1000000;
while (timeout > 0) while (timeout_ms > 0)
{ {
if (PQconsumeInput(conn) == 0) if (PQconsumeInput(conn) == 0)
{ {
log_warning(_("wait_connection_availability(): could not receive data from connection:\n %s"), log_warning(_("wait_connection_availability(): unable to receive data from connection"));
PQerrorMessage(conn)); log_detail("%s", PQerrorMessage(conn));
return 0; return 0;
} }
@@ -4190,17 +4270,17 @@ wait_connection_availability(PGconn *conn, long long timeout)
gettimeofday(&after, &tz); gettimeofday(&after, &tz);
timeout -= (after.tv_sec * 1000000 + after.tv_usec) - timeout_ms -= (after.tv_sec * 1000000 + after.tv_usec) -
(before.tv_sec * 1000000 + before.tv_usec); (before.tv_sec * 1000000 + before.tv_usec);
} }
if (timeout >= 0) if (timeout_ms >= 0)
{ {
return 1; return 1;
} }
log_warning(_("wait_connection_availability(): timeout reached")); log_warning(_("wait_connection_availability(): timeout (%i secs) reached"), timeout);
return -1; return -1;
} }
@@ -4263,6 +4343,25 @@ connection_ping(PGconn *conn)
} }
ExecStatusType
connection_ping_reconnect(PGconn *conn)
{
ExecStatusType ping_result = connection_ping(conn);
if (PQstatus(conn) != CONNECTION_OK)
{
log_warning(_("connection error, attempting to reset"));
log_detail("%s", PQerrorMessage(conn));
PQreset(conn);
ping_result = connection_ping(conn);
}
log_verbose(LOG_DEBUG, "connection_ping_reconnect(): result is %s", PQresStatus(ping_result));
return ping_result;
}
/* ==================== */ /* ==================== */
/* monitoring functions */ /* monitoring functions */
@@ -4647,6 +4746,11 @@ get_primary_current_lsn(PGconn *conn)
{ {
ptr = parse_lsn(PQgetvalue(res, 0, 0)); ptr = parse_lsn(PQgetvalue(res, 0, 0));
} }
else
{
log_db_error(conn, NULL, _("unable to execute get_primary_current_lsn()"));
}
PQclear(res); PQclear(res);
@@ -4673,6 +4777,10 @@ get_last_wal_receive_location(PGconn *conn)
{ {
ptr = parse_lsn(PQgetvalue(res, 0, 0)); ptr = parse_lsn(PQgetvalue(res, 0, 0));
} }
else
{
log_db_error(conn, NULL, _("unable to execute get_last_wal_receive_location()"));
}
PQclear(res); PQclear(res);
@@ -4781,11 +4889,12 @@ init_replication_info(ReplInfo *replication_info)
replication_info->replication_lag_time = 0; replication_info->replication_lag_time = 0;
replication_info->receiving_streamed_wal = true; replication_info->receiving_streamed_wal = true;
replication_info->wal_replay_paused = false; replication_info->wal_replay_paused = false;
replication_info->upstream_last_seen = -1;
} }
bool bool
get_replication_info(PGconn *conn, ReplInfo *replication_info) get_replication_info(PGconn *conn, t_server_type node_type, ReplInfo *replication_info)
{ {
PQExpBufferData query; PQExpBufferData query;
PGresult *res = NULL; PGresult *res = NULL;
@@ -4807,7 +4916,8 @@ get_replication_info(PGconn *conn, ReplInfo *replication_info)
" END " " END "
" END AS replication_lag_time, " " END AS replication_lag_time, "
" last_wal_receive_lsn >= last_wal_replay_lsn AS receiving_streamed_wal, " " last_wal_receive_lsn >= last_wal_replay_lsn AS receiving_streamed_wal, "
" wal_replay_paused " " wal_replay_paused, "
" upstream_last_seen "
" FROM ( " " FROM ( "
" SELECT CURRENT_TIMESTAMP AS ts, " " SELECT CURRENT_TIMESTAMP AS ts, "
" pg_catalog.pg_last_xact_replay_timestamp() AS last_xact_replay_timestamp, "); " pg_catalog.pg_last_xact_replay_timestamp() AS last_xact_replay_timestamp, ");
@@ -4821,7 +4931,7 @@ get_replication_info(PGconn *conn, ReplInfo *replication_info)
" CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE " " CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE "
" THEN FALSE " " THEN FALSE "
" ELSE pg_catalog.pg_is_wal_replay_paused() " " ELSE pg_catalog.pg_is_wal_replay_paused() "
" END AS wal_replay_paused "); " END AS wal_replay_paused, ");
} }
else else
{ {
@@ -4843,7 +4953,21 @@ get_replication_info(PGconn *conn, ReplInfo *replication_info)
" CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE " " CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE "
" THEN FALSE " " THEN FALSE "
" ELSE pg_catalog.pg_is_xlog_replay_paused() " " ELSE pg_catalog.pg_is_xlog_replay_paused() "
" END AS wal_replay_paused "); " END AS wal_replay_paused, ");
}
if (node_type == WITNESS)
{
appendPQExpBufferStr(&query,
" repmgr.get_upstream_last_seen() AS upstream_last_seen");
}
else
{
appendPQExpBufferStr(&query,
" CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE "
" THEN -1 "
" ELSE repmgr.get_upstream_last_seen() "
" END AS upstream_last_seen ");
} }
appendPQExpBufferStr(&query, appendPQExpBufferStr(&query,
@@ -4868,6 +4992,7 @@ get_replication_info(PGconn *conn, ReplInfo *replication_info)
replication_info->replication_lag_time = atoi(PQgetvalue(res, 0, 4)); replication_info->replication_lag_time = atoi(PQgetvalue(res, 0, 4));
replication_info->receiving_streamed_wal = atobool(PQgetvalue(res, 0, 5)); replication_info->receiving_streamed_wal = atobool(PQgetvalue(res, 0, 5));
replication_info->wal_replay_paused = atobool(PQgetvalue(res, 0, 6)); replication_info->wal_replay_paused = atobool(PQgetvalue(res, 0, 6));
replication_info->upstream_last_seen = atoi(PQgetvalue(res, 0, 7));
} }
termPQExpBuffer(&query); termPQExpBuffer(&query);
@@ -5053,7 +5178,7 @@ is_downstream_node_attached(PGconn *conn, char *node_name)
void void
set_primary_last_seen(PGconn *conn) set_upstream_last_seen(PGconn *conn)
{ {
PQExpBufferData query; PQExpBufferData query;
PGresult *res = NULL; PGresult *res = NULL;
@@ -5061,51 +5186,58 @@ set_primary_last_seen(PGconn *conn)
initPQExpBuffer(&query); initPQExpBuffer(&query);
appendPQExpBufferStr(&query, appendPQExpBufferStr(&query,
"SELECT repmgr.set_primary_last_seen()"); "SELECT repmgr.set_upstream_last_seen()");
res = PQexec(conn, query.data); res = PQexec(conn, query.data);
if (PQresultStatus(res) != PGRES_TUPLES_OK) if (PQresultStatus(res) != PGRES_TUPLES_OK)
{ {
log_db_error(conn, query.data, _("unable to execute repmgr.set_primary_last_seen()")); log_db_error(conn, query.data, _("unable to execute repmgr.set_upstream_last_seen()"));
} }
termPQExpBuffer(&query); termPQExpBuffer(&query);
PQclear(res); PQclear(res);
} }
int int
get_primary_last_seen(PGconn *conn) get_upstream_last_seen(PGconn *conn, t_server_type node_type)
{ {
PQExpBufferData query; PQExpBufferData query;
PGresult *res = NULL; PGresult *res = NULL;
int primary_last_seen = -1; int upstream_last_seen = -1;
initPQExpBuffer(&query); initPQExpBuffer(&query);
if (node_type == WITNESS)
{
appendPQExpBufferStr(&query,
"SELECT repmgr.get_upstream_last_seen()");
}
else
{
appendPQExpBufferStr(&query, appendPQExpBufferStr(&query,
"SELECT CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE " "SELECT CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE "
" THEN -1 " " THEN -1 "
" ELSE repmgr.get_primary_last_seen() " " ELSE repmgr.get_upstream_last_seen() "
" END AS primary_last_seen "); " END AS upstream_last_seen ");
}
res = PQexec(conn, query.data); res = PQexec(conn, query.data);
if (PQresultStatus(res) != PGRES_TUPLES_OK) if (PQresultStatus(res) != PGRES_TUPLES_OK)
{ {
log_db_error(conn, query.data, _("unable to execute repmgr.get_primary_last_seen()")); log_db_error(conn, query.data, _("unable to execute repmgr.get_upstream_last_seen()"));
} }
else else
{ {
primary_last_seen = atoi(PQgetvalue(res, 0, 0)); upstream_last_seen = atoi(PQgetvalue(res, 0, 0));
} }
termPQExpBuffer(&query); termPQExpBuffer(&query);
PQclear(res); PQclear(res);
return primary_last_seen; return upstream_last_seen;
} }

View File

@@ -308,6 +308,7 @@ typedef struct
int replication_lag_time; int replication_lag_time;
bool receiving_streamed_wal; bool receiving_streamed_wal;
bool wal_replay_paused; bool wal_replay_paused;
int upstream_last_seen;
} ReplInfo; } ReplInfo;
typedef struct typedef struct
@@ -414,6 +415,8 @@ bool set_config_bool(PGconn *conn, const char *config_param, bool state);
int guc_set(PGconn *conn, const char *parameter, const char *op, const char *value); int guc_set(PGconn *conn, const char *parameter, const char *op, const char *value);
int guc_set_typed(PGconn *conn, const char *parameter, const char *op, const char *value, const char *datatype); int guc_set_typed(PGconn *conn, const char *parameter, const char *op, const char *value, const char *datatype);
bool get_pg_setting(PGconn *conn, const char *setting, char *output); bool get_pg_setting(PGconn *conn, const char *setting, char *output);
bool alter_system_int(PGconn *conn, const char *name, int value);
bool pg_reload_conf(PGconn *conn);
/* server information functions */ /* server information functions */
bool get_cluster_size(PGconn *conn, char *size); bool get_cluster_size(PGconn *conn, char *size);
@@ -435,6 +438,7 @@ pid_t repmgrd_get_pid(PGconn *conn);
bool repmgrd_is_running(PGconn *conn); bool repmgrd_is_running(PGconn *conn);
bool repmgrd_is_paused(PGconn *conn); bool repmgrd_is_paused(PGconn *conn);
bool repmgrd_pause(PGconn *conn, bool pause); bool repmgrd_pause(PGconn *conn, bool pause);
pid_t get_wal_receiver_pid(PGconn *conn);
/* extension functions */ /* extension functions */
ExtensionStatus get_repmgr_extension_status(PGconn *conn, t_extension_versions *extversions); ExtensionStatus get_repmgr_extension_status(PGconn *conn, t_extension_versions *extversions);
@@ -509,12 +513,13 @@ bool get_tablespace_name_by_location(PGconn *conn, const char *location, char *
/* asynchronous query functions */ /* asynchronous query functions */
bool cancel_query(PGconn *conn, int timeout); bool cancel_query(PGconn *conn, int timeout);
int wait_connection_availability(PGconn *conn, long long timeout); int wait_connection_availability(PGconn *conn, int timeout);
/* node availability functions */ /* node availability functions */
bool is_server_available(const char *conninfo); bool is_server_available(const char *conninfo);
bool is_server_available_params(t_conninfo_param_list *param_list); bool is_server_available_params(t_conninfo_param_list *param_list);
ExecStatusType connection_ping(PGconn *conn); ExecStatusType connection_ping(PGconn *conn);
ExecStatusType connection_ping_reconnect(PGconn *conn);
/* monitoring functions */ /* monitoring functions */
void void
@@ -549,12 +554,12 @@ XLogRecPtr get_primary_current_lsn(PGconn *conn);
XLogRecPtr get_node_current_lsn(PGconn *conn); XLogRecPtr get_node_current_lsn(PGconn *conn);
XLogRecPtr get_last_wal_receive_location(PGconn *conn); XLogRecPtr get_last_wal_receive_location(PGconn *conn);
void init_replication_info(ReplInfo *replication_info); void init_replication_info(ReplInfo *replication_info);
bool get_replication_info(PGconn *conn, ReplInfo *replication_info); bool get_replication_info(PGconn *conn, t_server_type node_type, ReplInfo *replication_info);
int get_replication_lag_seconds(PGconn *conn); int get_replication_lag_seconds(PGconn *conn);
void get_node_replication_stats(PGconn *conn, t_node_info *node_info); void get_node_replication_stats(PGconn *conn, t_node_info *node_info);
bool is_downstream_node_attached(PGconn *conn, char *node_name); bool is_downstream_node_attached(PGconn *conn, char *node_name);
void set_primary_last_seen(PGconn *conn); void set_upstream_last_seen(PGconn *conn);
int get_primary_last_seen(PGconn *conn); int get_upstream_last_seen(PGconn *conn, t_server_type node_type);
bool is_wal_replay_paused(PGconn *conn, bool check_pending_wal); bool is_wal_replay_paused(PGconn *conn, bool check_pending_wal);
/* BDR functions */ /* BDR functions */

View File

@@ -61,7 +61,7 @@ clean:
maintainer-clean: maintainer-clean:
rm -rf html rm -rf html
rm -rf Makefile rm -f Makefile
zip: html zip: html
cp -r html repmgr-docs-$(REPMGR_VERSION) cp -r html repmgr-docs-$(REPMGR_VERSION)

View File

@@ -100,8 +100,7 @@
and recloning standbys from this. and recloning standbys from this.
</para> </para>
<para> <para>
To minimize downtime during major upgrades, for more recent PostgreSQL To minimize downtime during major upgrades from PostgreSQL 9.4 and later,
versions (PostgreSQL 9.4 and later),
<ulink url="https://www.2ndquadrant.com/en/resources/pglogical/">pglogical</ulink> <ulink url="https://www.2ndquadrant.com/en/resources/pglogical/">pglogical</ulink>
can be used to set up a parallel cluster using the newer PostgreSQL version, can be used to set up a parallel cluster using the newer PostgreSQL version,
which can be kept in sync with the existing production cluster until the which can be kept in sync with the existing production cluster until the

View File

@@ -32,7 +32,7 @@
REPMGRD_OPTS="--daemonize=false"</programlisting> REPMGRD_OPTS="--daemonize=false"</programlisting>
</para> </para>
<para> <para>
For further details, see <link linkend="repmgrd-configuration-debian-ubuntu">repmgrd daemon configuration on Debian/Ubuntu</link>. For further details, see <link linkend="repmgrd-configuration-debian-ubuntu">repmgrd configuration on Debian/Ubuntu</link>.
</para> </para>
</important> </important>
@@ -72,9 +72,9 @@ REPMGRD_OPTS="--daemonize=false"</programlisting>
</para> </para>
<note> <note>
<para> <para>
For these commands to work reliably, the configuration file settings These commands require the configuration file settings
<varname>repmgrd_service_start_command</varname> and <varname>repmgrd_service_stop_command</varname> <varname>repmgrd_service_start_command</varname> and <varname>repmgrd_service_stop_command</varname>
should be set in <filename>repmgr.conf</filename>. in <filename>repmgr.conf</filename> to be set.
</para> </para>
</note> </note>
</listitem> </listitem>
@@ -82,8 +82,8 @@ REPMGRD_OPTS="--daemonize=false"</programlisting>
<listitem> <listitem>
<para> <para>
<link linkend="repmgr-daemon-status"><command>repmgr daemon status</command></link> <link linkend="repmgr-daemon-status"><command>repmgr daemon status</command></link>
displays the interval (in seconds) since the <application>repmgrd</application> instance additionally displays the node priority and the interval (in seconds) since the
last verified its upstream node was available. <application>repmgrd</application> instance last verified its upstream node was available.
</para> </para>
</listitem> </listitem>
@@ -132,7 +132,7 @@ REPMGRD_OPTS="--daemonize=false"</programlisting>
<listitem> <listitem>
<para> <para>
Add check <link linkend="repmgr-standby-switchover"><command>repmgr standby switchover</command></link> Add check to <link linkend="repmgr-standby-switchover"><command>repmgr standby switchover</command></link>
to ensure the data directory on the demotion candidate is configured correctly in <filename>repmgr.conf</filename>. to ensure the data directory on the demotion candidate is configured correctly in <filename>repmgr.conf</filename>.
This is to ensure that &repmgr;, when remotely executed on the demotion candidate, can correctly verify This is to ensure that &repmgr;, when remotely executed on the demotion candidate, can correctly verify
that PostgreSQL on the demotion candidate was shut down cleanly. GitHub #523. that PostgreSQL on the demotion candidate was shut down cleanly. GitHub #523.
@@ -161,6 +161,33 @@ REPMGRD_OPTS="--daemonize=false"</programlisting>
</para> </para>
</listitem> </listitem>
<listitem>
<para>
Add option <option>connection_check_type</option> to enable selection of the method
<application>repmgrd</application> uses to determine whether the upstream node is available.
</para>
<para>
Possible values are <literal>ping</literal> (default; uses <command>PQping()</command> to
determine server availability), <literal>connection</literal> (attempst to make a new connection to
the upstream node), and <literal>query</literal> (determines server availability
by executing an SQL statement on the node via the existing connection).
</para>
</listitem>
<listitem>
<para>
New configuration option <link linkend="repmgrd-failover-validation"><option>failover_validation_command</option></link>
to allow an external mechanism to validate the failover decision made by <application>repmgrd</application>.
</para>
</listitem>
<listitem>
<para>
New configuration option <link linkend="repmgrd-standby-disconnection-on-failover"><option>standby_disconnect_on_failover</option></link>
to force standbys to disconnect their WAL receivers before making a failover decision.
</para>
</listitem>
</itemizedlist> </itemizedlist>
</para> </para>
</sect2> </sect2>
@@ -185,6 +212,14 @@ REPMGRD_OPTS="--daemonize=false"</programlisting>
</para> </para>
</listitem> </listitem>
<listitem>
<para>
&repmgr;: when executing <link linkend="repmgr-standby-clone"><command>repmgr standby clone</command></link>,
recheck primary/upstream connection(s) after the data copy operation is complete, as these may
have gone away.
</para>
</listitem>
<listitem> <listitem>
<para> <para>
&repmgr;: when executing <link linkend="repmgr-standby-switchover"><command>repmgr standby switchover</command></link>, &repmgr;: when executing <link linkend="repmgr-standby-switchover"><command>repmgr standby switchover</command></link>,

View File

@@ -1,15 +1,15 @@
<sect1 id="configuration-file" xreflabel="configuration file location"> <sect1 id="configuration-file" xreflabel="configuration file">
<indexterm> <indexterm>
<primary>repmgr.conf</primary> <primary>repmgr.conf</primary>
<secondary>location</secondary>
</indexterm> </indexterm>
<indexterm> <indexterm>
<primary>configuration</primary> <primary>configuration</primary>
<secondary>repmgr.conf location</secondary> <secondary>repmgr.conf</secondary>
</indexterm> </indexterm>
<title>Configuration file location</title> <title>Configuration file</title>
<para> <para>
<application>repmgr</application> and <application>repmgrd</application> <application>repmgr</application> and <application>repmgrd</application>
use a common configuration file, by default called use a common configuration file, by default called
@@ -21,6 +21,55 @@
for more details. for more details.
</para> </para>
<sect2 id="configuration-file-format" xreflabel="configuration file format">
<indexterm>
<primary>repmgr.conf</primary>
<secondary>format</secondary>
</indexterm>
<title>Configuration file format</title>
<para>
<filename>repmgr.conf</filename> is a plain text file with one parameter/value
combination per line.
</para>
<para>
Whitespace is insignificant (except within a quoted parameter value) and blank lines are ignored.
Hash marks (<literal>#</literal>) designate the remainder of the line as a comment.
Parameter values that are not simple identifiers or numbers should be single-quoted.
Note that single quote cannot be embedded in a parameter value.
</para>
<important>
<para>
&repmgr; will interpret double-quotes as being part of a string value; only use single quotes
to quote parameter values.
</para>
</important>
<para>
Example of a valid <filename>repmgr.conf</filename> file:
<programlisting>
# repmgr.conf
node_id=1
node_name= node1
conninfo ='host=node1 dbname=repmgr user=repmgr connect_timeout=2'
data_directory = /var/lib/pgsql/11/data</programlisting>
</para>
</sect2>
<sect2 id="configuration-file-location" xreflabel="configuration file location">
<indexterm>
<primary>repmgr.conf</primary>
<secondary>location</secondary>
</indexterm>
<title>Configuration file location</title>
<para> <para>
The configuration file will be searched for in the following locations: The configuration file will be searched for in the following locations:
<itemizedlist spacing="compact" mark="bullet"> <itemizedlist spacing="compact" mark="bullet">
@@ -50,7 +99,7 @@
Note that if a file is explicitly specified with <literal>-f/--config-file</literal>, Note that if a file is explicitly specified with <literal>-f/--config-file</literal>,
an error will be raised if it is not found or not readable, and no attempt will be made to an error will be raised if it is not found or not readable, and no attempt will be made to
check default locations; this is to prevent <application>repmgr</application> unexpectedly check default locations; this is to prevent <application>repmgr</application> unexpectedly
reading the wrong configuraton file. reading the wrong configuration file.
</para> </para>
<note> <note>
@@ -66,4 +115,6 @@
<filename>/path/to/repmgr.conf</filename>). <filename>/path/to/repmgr.conf</filename>).
</para> </para>
</note> </note>
</sect2>
</sect1> </sect1>

View File

@@ -1,7 +1,6 @@
<chapter id="using-witness-server"> <chapter id="using-witness-server">
<indexterm> <indexterm>
<primary>witness server</primary> <primary>witness server</primary>
<seealso>Using a witness server with repmgrd</seealso>
</indexterm> </indexterm>
@@ -9,8 +8,9 @@
<para> <para>
A <xref linkend="witness-server"> is a normal PostgreSQL instance which A <xref linkend="witness-server"> is a normal PostgreSQL instance which
is not part of the streaming replication cluster; its purpose is, if a is not part of the streaming replication cluster; its purpose is, if a
failover situation occurs, to provide proof that the primary server failover situation occurs, to provide proof that it is the primary server
itself is unavailable. itself which is unavailable, rather than e.g. a network split between
different physical locations.
</para> </para>
<para> <para>
@@ -20,7 +20,7 @@
if the primary becomes unavailable it's possible for the standby to decide whether if the primary becomes unavailable it's possible for the standby to decide whether
it can promote itself without risking a "split brain" scenario: if it can't see either the it can promote itself without risking a "split brain" scenario: if it can't see either the
witness or the primary server, it's likely there's a network-level interruption witness or the primary server, it's likely there's a network-level interruption
and it should not promote itself. If it can seen the witness but not the primary, and it should not promote itself. If it can see the witness but not the primary,
this proves there is no network interruption and the primary itself is unavailable, this proves there is no network interruption and the primary itself is unavailable,
and it can therefore promote itself (and ideally take action to fence the and it can therefore promote itself (and ideally take action to fence the
former primary). former primary).
@@ -53,7 +53,7 @@
in the same physical location as the cluster's primary server. in the same physical location as the cluster's primary server.
</para> </para>
<para> <para>
This instance should *not* be on the same physical host as the primary server, This instance should <emphasis>not</emphasis> be on the same physical host as the primary server,
as otherwise if the primary server fails due to hardware issues, the witness as otherwise if the primary server fails due to hardware issues, the witness
server will be lost too. server will be lost too.
</para> </para>

View File

@@ -88,7 +88,7 @@
<para> <para>
The values provided for <literal>%t</literal> and <literal>%d</literal> The values provided for <literal>%t</literal> and <literal>%d</literal>
will probably contain spaces, so should be quoted in the provided command may contain spaces, so should be quoted in the provided command
configuration, e.g.: configuration, e.g.:
<programlisting> <programlisting>
event_notification_command='/path/to/some/script %n %e %s "%t" "%d"' event_notification_command='/path/to/some/script %n %e %s "%t" "%d"'

View File

@@ -50,16 +50,10 @@
<!ENTITY event-notifications SYSTEM "event-notifications.sgml"> <!ENTITY event-notifications SYSTEM "event-notifications.sgml">
<!ENTITY upgrading-repmgr SYSTEM "upgrading-repmgr.sgml"> <!ENTITY upgrading-repmgr SYSTEM "upgrading-repmgr.sgml">
<!ENTITY repmgrd-overview SYSTEM "repmgrd-overview.sgml">
<!ENTITY repmgrd-automatic-failover SYSTEM "repmgrd-automatic-failover.sgml"> <!ENTITY repmgrd-automatic-failover SYSTEM "repmgrd-automatic-failover.sgml">
<!ENTITY repmgrd-configuration SYSTEM "repmgrd-configuration.sgml"> <!ENTITY repmgrd-configuration SYSTEM "repmgrd-configuration.sgml">
<!ENTITY repmgrd-demonstration SYSTEM "repmgrd-demonstration.sgml"> <!ENTITY repmgrd-operation SYSTEM "repmgrd-operation.sgml">
<!ENTITY repmgrd-monitoring SYSTEM "repmgrd-monitoring.sgml">
<!ENTITY repmgrd-degraded-monitoring SYSTEM "repmgrd-degraded-monitoring.sgml">
<!ENTITY repmgrd-cascading-replication SYSTEM "repmgrd-cascading-replication.sgml">
<!ENTITY repmgrd-network-split SYSTEM "repmgrd-network-split.sgml">
<!ENTITY repmgrd-witness-server SYSTEM "repmgrd-witness-server.sgml">
<!ENTITY repmgrd-pausing SYSTEM "repmgrd-pausing.sgml">
<!ENTITY repmgrd-notes SYSTEM "repmgrd-notes.sgml">
<!ENTITY repmgrd-bdr SYSTEM "repmgrd-bdr.sgml"> <!ENTITY repmgrd-bdr SYSTEM "repmgrd-bdr.sgml">
<!ENTITY repmgr-primary-register SYSTEM "repmgr-primary-register.sgml"> <!ENTITY repmgr-primary-register SYSTEM "repmgr-primary-register.sgml">

View File

@@ -196,11 +196,31 @@
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry>
<term><option>ERR_BAD_CONFIG (1)</option></term>
<listitem>
<para>
An issue was encountered while attempting to retrieve
&repmgr; metadata.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><option>ERR_DB_CONN (6)</option></term>
<listitem>
<para>
&repmgr; was unable to connect to the local PostgreSQL instance.
</para>
</listitem>
</varlistentry>
<varlistentry> <varlistentry>
<term><option>ERR_NODE_STATUS (25)</option></term> <term><option>ERR_NODE_STATUS (25)</option></term>
<listitem> <listitem>
<para> <para>
One or more issues were detected. One or more issues were detected with the replication configuration,
e.g. a node was not in its expected state.
</para> </para>
</listitem> </listitem>
</varlistentry> </varlistentry>

View File

@@ -33,7 +33,10 @@
<command>repmgr daemon status</command> can be executed on any active node in the <command>repmgr daemon status</command> can be executed on any active node in the
replication cluster. A valid <filename>repmgr.conf</filename> file is required. replication cluster. A valid <filename>repmgr.conf</filename> file is required.
</para> </para>
<para>
If PostgreSQL is not running on a node, &repmgr; will not be able to determine the
status of that node's <application>repmgrd</application> instance.
</para>
<note> <note>
<para> <para>
After restarting PostgreSQL on any node, the <application>repmgrd</application> instance After restarting PostgreSQL on any node, the <application>repmgrd</application> instance
@@ -126,19 +129,19 @@
<listitem> <listitem>
<simpara> <simpara>
<application>repmgrd</application> running (1 = running, 0 = not running) <application>repmgrd</application> running (1 = running, 0 = not running, -1 = unknown)
</simpara> </simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara> <simpara>
<application>repmgrd</application> PID (-1 if not running) <application>repmgrd</application> PID (-1 if not running or status unknown)
</simpara> </simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara> <simpara>
<application>repmgrd</application> paused (1 = paused, 0 = not paused) <application>repmgrd</application> paused (1 = paused, 0 = not paused, -1 = unknown)
</simpara> </simpara>
</listitem> </listitem>
@@ -150,7 +153,7 @@
<listitem> <listitem>
<simpara> <simpara>
interval in seconds since the node's upstream was last seen interval in seconds since the node's upstream was last seen (this will be -1 if the value could not be retrieved, or the node is primary)
</simpara> </simpara>
</listitem> </listitem>

View File

@@ -99,7 +99,7 @@
</indexterm> </indexterm>
<simpara> <simpara>
<literal>promote_check_interval</literal>: <literal>promote_check_interval</literal>:
interval (in seconds, default: 1 seconds) to wait between each check interval (in seconds, default: 1 second) to wait between each check
to determine whether the standby has been promoted. to determine whether the standby has been promoted.
</simpara> </simpara>
</listitem> </listitem>

View File

@@ -29,21 +29,21 @@
</para> </para>
<para> <para>
&repmgr; was developed by &repmgr; is developed by
<ulink url="https://2ndquadrant.com">2ndQuadrant</ulink> <ulink url="https://2ndquadrant.com">2ndQuadrant</ulink>
along with contributions from other individuals and companies. along with contributions from other individuals and companies.
Contributions from the community are appreciated and welcome - get Contributions from the community are appreciated and welcome - get
in touch via <ulink url="https://github.com/2ndQuadrant/repmgr">github</> in touch via <ulink url="https://github.com/2ndQuadrant/repmgr">github</ulink>
or <ulink url="https://groups.google.com/group/repmgr">the mailing list/forum</>. or <ulink url="https://groups.google.com/group/repmgr">the mailing list/forum</ulink>.
Multiple 2ndQuadrant customers contribute funding Multiple 2ndQuadrant customers contribute funding
to make repmgr development possible. to make repmgr development possible.
</para> </para>
<para> <para>
2ndQuadrant, a Platinum sponsor of the PostgreSQL project, &repmgr; is fully supported by 2ndQuadrant's
continues to develop repmgr to meet internal needs and those of customers. <ulink url="https://www.2ndquadrant.com/en/support/support-postgresql/">24/7 Production Support</ulink>.
Other companies as well as individual developers 2ndQuadrant, a Major Sponsor of the PostgreSQL project, continues to develop and maintain &repmgr;.
are welcome to participate in the efforts. Other companies as well as individual developers are welcome to participate in the efforts.
</para> </para>
</abstract> </abstract>
@@ -80,16 +80,10 @@
<part id="using-repmgrd"> <part id="using-repmgrd">
<title>Using repmgrd</title> <title>Using repmgrd</title>
&repmgrd-overview;
&repmgrd-automatic-failover; &repmgrd-automatic-failover;
&repmgrd-configuration; &repmgrd-configuration;
&repmgrd-demonstration; &repmgrd-operation;
&repmgrd-cascading-replication;
&repmgrd-network-split;
&repmgrd-witness-server;
&repmgrd-pausing;
&repmgrd-degraded-monitoring;
&repmgrd-monitoring;
&repmgrd-notes;
&repmgrd-bdr; &repmgrd-bdr;
</part> </part>

View File

@@ -13,5 +13,230 @@
providing monitoring information about the state of each standby. providing monitoring information about the state of each standby.
</para> </para>
<sect1 id="repmgrd-witness-server" xreflabel="Using a witness server with repmgrd">
<indexterm>
<primary>repmgrd</primary>
<secondary>witness server</secondary>
</indexterm>
<indexterm>
<primary>witness server</primary>
<secondary>repmgrd</secondary>
</indexterm>
<title>Using a witness server with repmgrd</title>
<para>
In a situation caused e.g. by a network interruption between two
data centres, it's important to avoid a &quot;split-brain&quot; situation where
both sides of the network assume they are the active segment and the
side without an active primary unilaterally promotes one of its standbys.
</para>
<para>
To prevent this situation happening, it's essential to ensure that one
network segment has a &quot;voting majority&quot;, so other segments will know
they're in the minority and not attempt to promote a new primary. Where
an odd number of servers exists, this is not an issue. However, if each
network has an even number of nodes, it's necessary to provide some way
of ensuring a majority, which is where the witness server becomes useful.
</para>
<para>
This is not a fully-fledged standby node and is not integrated into
replication, but it effectively represents the &quot;casting vote&quot; when
deciding which network segment has a majority. A witness server can
be set up using <link linkend="repmgr-witness-register"><command>repmgr witness register</command></link>;
see also section <link linkend="using-witness-server">Using a witness server</link>.
</para>
<note>
<para>
It only
makes sense to create a witness server in conjunction with running
<application>repmgrd</application>; the witness server will require its own
<application>repmgrd</application> instance.
</para>
</note>
</sect1>
<sect1 id="repmgrd-network-split" xreflabel="Handling network splits with repmgrd">
<indexterm>
<primary>repmgrd</primary>
<secondary>network splits</secondary>
</indexterm>
<indexterm>
<primary>network splits</primary>
</indexterm>
<title>Handling network splits with repmgrd</title>
<para>
A common pattern for replication cluster setups is to spread servers over
more than one datacentre. This can provide benefits such as geographically-
distributed read replicas and DR (disaster recovery capability). However
this also means there is a risk of disconnection at network level between
datacentre locations, which would result in a split-brain scenario if
servers in a secondary data centre were no longer able to see the primary
in the main data centre and promoted a standby among themselves.
</para>
<para>
&repmgr; enables provision of &quot;<xref linkend="witness-server">&quot; to
artificially create a quorum of servers in a particular location, ensuring
that nodes in another location will not elect a new primary if they
are unable to see the majority of nodes. However this approach does not
scale well, particularly with more complex replication setups, e.g.
where the majority of nodes are located outside of the primary datacentre.
It also means the <literal>witness</literal> node needs to be managed as an
extra PostgreSQL instance outside of the main replication cluster, which
adds administrative and programming complexity.
</para>
<para>
<literal>repmgr4</literal> introduces the concept of <literal>location</literal>:
each node is associated with an arbitrary location string (default is
<literal>default</literal>); this is set in <filename>repmgr.conf</filename>, e.g.:
<programlisting>
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/data'
location='dc1'</programlisting>
</para>
<para>
In a failover situation, <application>repmgrd</application> will check if any servers in the
same location as the current primary node are visible. If not, <application>repmgrd</application>
will assume a network interruption and not promote any node in any
other location (it will however enter <link linkend="repmgrd-degraded-monitoring">degraded monitoring</link>
mode until a primary becomes visible).
</para>
</sect1>
<sect1 id="repmgrd-standby-disconnection-on-failover" xreflabel="Standby disconnection on failover">
<indexterm>
<primary>repmgrd</primary>
<secondary>standby disconnection on failover</secondary>
</indexterm>
<indexterm>
<primary>standby disconnection on failover</primary>
</indexterm>
<title>Standby disconnection on failover</title>
<para>
If <option>standby_disconnect_on_failover</option> is set to <literal>true</literal> in
<filename>repmgr.conf</filename>, in a failover situation <application>repmgrd</application> will forcibly disconnect
the local node's WAL receiver before making a failover decision.
</para>
<note>
<para>
<option>standby_disconnect_on_failover</option> is available from PostgreSQL 9.5 and later.
Additionally this requires that the <literal>repmgr</literal> database user is a superuser.
</para>
</note>
<para>
By doing this, it's possible to ensure that, at the point the failover decision is made, no nodes
are receiving data from the primary and their LSN location will be static.
</para>
<important>
<para>
<option>standby_disconnect_on_failover</option> <emphasis>must</emphasis> be set to the same value on
all nodes.
</para>
</important>
<para>
Note that when using <option>standby_disconnect_on_failover</option> there will be a delay of 5 seconds
plus however many seconds it takes to confirm the WAL receiver is disconnected before
<application>repmgrd</application> proceeds with the failover decision.
</para>
<para>
Following the failover operation, no matter what the outcome, each node will reconnect its WAL receiver.
</para>
</sect1>
<sect1 id="repmgrd-failover-validation" xreflabel="Failover validation">
<indexterm>
<primary>repmgrd</primary>
<secondary>failover validation</secondary>
</indexterm>
<indexterm>
<primary>failover validation</primary>
</indexterm>
<title>Failover validation</title>
<para>
From <link linkend="release-4.3">repmgr 4.3</link>, &repmgr; makes it possible to provide a script
to <application>repmgrd</application> which, in a failover situation,
will be executed by the promotion candidate (the node which has been selected
to be the new primary) to confirm whether the node should actually be promoted.
</para>
<para>
To use this, <option>failover_validation_command</option> in <filename>repmgr.conf</filename>
to a script executable by the <literal>postgres</literal> system user, e.g.:
<programlisting>
failover_validation_command=/path/to/script.sh %n %a</programlisting>
</para>
<para>
The <literal>%n</literal> parameter will be replaced with the node ID, and the
<literal>%a</literal> parameter will be replaced by the node name when the script is executed.
</para>
<para>
This script must return an exit code of <literal>0</literal> to indicate the node should promote itself.
Any other value will result in the promotion being aborted and the election rerun.
There is a pause of <option>election_rerun_interval</option> seconds before the election is rerun.
</para>
<para>
Sample <application>repmgrd</application> log file output during which the failover validation
script rejects the proposed promotion candidate:
<programlisting>
[2019-03-13 21:01:30] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
[2019-03-13 21:01:30] [NOTICE] promotion candidate is "node2" (ID: 2)
[2019-03-13 21:01:30] [NOTICE] executing "failover_validation_command"
[2019-03-13 21:01:30] [DETAIL] /usr/local/bin/failover-validation.sh 2
[2019-03-13 21:01:30] [INFO] output returned by failover validation command:
Node ID: 2
[2019-03-13 21:01:30] [NOTICE] failover validation command returned a non-zero value: "1"
[2019-03-13 21:01:30] [NOTICE] promotion candidate election will be rerun
[2019-03-13 21:01:30] [INFO] 1 followers to notify
[2019-03-13 21:01:30] [NOTICE] notifying node "node3" (node ID: 3) to rerun promotion candidate selection
INFO: node 3 received notification to rerun promotion candidate election
[2019-03-13 21:01:30] [NOTICE] rerunning election after 15 seconds ("election_rerun_interval")</programlisting>
</para>
</sect1>
<sect1 id="cascading-replication" xreflabel="Cascading replication">
<indexterm>
<primary>repmgrd</primary>
<secondary>cascading replication</secondary>
</indexterm>
<indexterm>
<primary>cascading replication</primary>
<secondary>repmgrd</secondary>
</indexterm>
<title>repmgrd and cascading replication</title>
<para>
Cascading replication - where a standby can connect to an upstream node and not
the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and
<application>repmgrd</application> support cascading replication by keeping track of the relationship
between standby servers - each node record is stored with the node id of its
upstream ("parent") server (except of course the primary server).
</para>
<para>
In a failover situation where the primary node fails and a top-level standby
is promoted, a standby connected to another standby will not be affected
and continue working as normal (even if the upstream standby it's connected
to becomes the primary node). If however the node's direct upstream fails,
the &quot;cascaded standby&quot; will attempt to reconnect to that node's parent
(unless <varname>failover</varname> is set to <literal>manual</literal> in
<filename>repmgr.conf</filename>).
</para>
</sect1>
</chapter> </chapter>

View File

@@ -1,24 +0,0 @@
<chapter id="repmgrd-cascading-replication">
<indexterm>
<primary>repmgrd</primary>
<secondary>cascading replication</secondary>
</indexterm>
<title>repmgrd and cascading replication</title>
<para>
Cascading replication - where a standby can connect to an upstream node and not
the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and
<application>repmgrd</application> support cascading replication by keeping track of the relationship
between standby servers - each node record is stored with the node id of its
upstream ("parent") server (except of course the primary server).
</para>
<para>
In a failover situation where the primary node fails and a top-level standby
is promoted, a standby connected to another standby will not be affected
and continue working as normal (even if the upstream standby it's connected
to becomes the primary node). If however the node's direct upstream fails,
the &quot;cascaded standby&quot; will attempt to reconnect to that node's parent
(unless <varname>failover</varname> is set to <literal>manual</literal> in
<filename>repmgr.conf</filename>).
</para>
</chapter>

View File

@@ -5,7 +5,7 @@
<secondary>configuration</secondary> <secondary>configuration</secondary>
</indexterm> </indexterm>
<title>repmgrd configuration</title> <title>repmgrd setup and configuration</title>
<para> <para>
<application>repmgrd</application> is a daemon which runs on each PostgreSQL node, <application>repmgrd</application> is a daemon which runs on each PostgreSQL node,
@@ -20,7 +20,7 @@
</para> </para>
<sect1 id="repmgrd-basic-configuration"> <sect1 id="repmgrd-basic-configuration">
<title>repmgrd basic configuration</title> <title>repmgrd configuration</title>
<para> <para>
To use <application>repmgrd</application>, its associated function library <emphasis>must</emphasis> be To use <application>repmgrd</application>, its associated function library <emphasis>must</emphasis> be
@@ -34,21 +34,206 @@
the <ulink url="https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-SHARED-PRELOAD-LIBRARIES">PostgreSQL documentation</ulink>. the <ulink url="https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-SHARED-PRELOAD-LIBRARIES">PostgreSQL documentation</ulink>.
</para> </para>
<para>
The following configuraton options apply to <application>repmgrd</application> in all circumstances:
</para>
<variablelist>
<varlistentry>
<indexterm>
<primary>monitor_interval_secs</primary>
</indexterm>
<term><option>monitor_interval_secs</option></term>
<listitem>
<para>
The interval (in seconds, default: <literal>2</literal>) to check the availability of the upstream node.
</para>
</listitem>
</varlistentry>
<varlistentry>
<indexterm>
<primary>connection_check_type</primary>
</indexterm>
<term><option>connection_check_type</option></term>
<listitem>
<para>
The option <option>connection_check_type</option> is used to select the method
<application>repmgrd</application> uses to determine whether the upstream node is available.
</para>
<para>
Possible values are:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<literal>ping</literal> (default) - uses <command>PQping()</command> to
determine server availability
</simpara>
</listitem>
<listitem>
<simpara>
<literal>connection</literal> - determines server availability
by attempt ingto make a new connection to the upstream node
</simpara>
</listitem>
<listitem>
<simpara>
<literal>query</literal> - determines server availability
by executing an SQL statement on the node via the existing connection
</simpara>
</listitem>
</itemizedlist>
</para>
</listitem>
</varlistentry>
<varlistentry>
<indexterm>
<primary>reconnect_attempts</primary>
</indexterm>
<term><option>reconnect_attempts</option></term>
<listitem>
<para>
The number of attempts (default: <literal>6</literal>) will be made to reconnect to an unreachable
upstream node before initiating a failover.
</para>
<para>
There will be an interval of <option>reconnect_interval</option> seconds between each reconnection
attempt.
</para>
</listitem>
</varlistentry>
<varlistentry>
<indexterm>
<primary>reconnect_interval</primary>
</indexterm>
<term><option>reconnect_interval</option></term>
<listitem>
<para>
Interval (in seconds, default: <literal>10</literal>) between attempts to reconnect to an unreachable
upstream node.
</para>
<para>
The number of reconnection attempts is defined by the parameter <option>reconnect_attempts</option>.
</para>
</listitem>
</varlistentry>
<varlistentry>
<indexterm>
<primary>degraded_monitoring_timeout</primary>
</indexterm>
<term><option>degraded_monitoring_timeout</option></term>
<listitem>
<para>
Interval (in seconds) after which <application>repmgrd</application> will terminate if
either of the servers (local node and or upstream node) being monitored is no longer available
(<link linkend="repmgrd-degraded-monitoring">degraded monitoring mode</link>).
</para>
<para>
<literal>-1</literal> (default) disables this timeout completely.
</para>
</listitem>
</varlistentry>
</variablelist>
<para>
See also <filename><ulink url="https://raw.githubusercontent.com/2ndQuadrant/repmgr/master/repmgr.conf.sample">repmgr.conf.sample</ulink></filename> for an annotated sample configuration file.
</para>
<sect2 id="repmgrd-automatic-failover-configuration"> <sect2 id="repmgrd-automatic-failover-configuration">
<title>Automatic failover configuration</title> <title>Required configuration for automatic failover</title>
<para> <para>
If using automatic failover, the following <application>repmgrd</application> options *must* be set in The following <application>repmgrd</application> options <emphasis>must</emphasis> be set in
<filename>repmgr.conf</filename>: <filename>repmgr.conf</filename>:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara><option>failover</option></simpara>
</listitem>
<listitem>
<simpara><option>promote_command</option></simpara>
</listitem>
<listitem>
<simpara><option>follow_command</option></simpara>
</listitem>
</itemizedlist>
</para>
<para>
Example:
<programlisting> <programlisting>
failover=automatic failover=automatic
promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file' promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'</programlisting> follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'</programlisting>
</para> </para>
<para> <para>
Adjust file paths as appropriate; alway specify the full path to the &repmgr; binary. Details of each option are as follows:
</para>
<variablelist>
<varlistentry>
<indexterm>
<primary>failover</primary>
</indexterm>
<term><option>failover</option></term>
<listitem>
<para>
<option>failover</option> can be one of <literal>automatic</literal> or <literal>manual</literal>.
</para>
<note>
<para>
If <option>failover</option> is set to <literal>manual</literal>, <application>repmgrd</application>
will not take any action if a failover situation is detected, and the node may need to
be modified manually (e.g. by executing <command><link linkend="repmgr-standby-follow">repmgr standby follow</link></command>).
</para>
</note>
</listitem>
</varlistentry>
<varlistentry>
<indexterm>
<primary>promote_command</primary>
</indexterm>
<term><option>promote_command</option></term>
<listitem>
<para>
The program or script defined in <option>promote_command</option> will be executed
in a failover situation when <application>repmgrd</application> determines that
the current node is to become the new primary node.
</para>
<para>
Normally <option>promote_command</option> is set as &repmgr;'s
<command><link linkend="repmgr-standby-promote">repmgr standby promote</link></command> command.
</para>
<para>
It is also possible to provide e.g. a shell script to e.g. perform user-defined tasks
before promoting the current node. In this case the script <emphasis>must</emphasis>
at some point execute <command><link linkend="repmgr-standby-promote">repmgr standby promote</link></command>
to promote the node; if this is not done, &repmgr; metadata will not be updated and
&repmgr; will no longer function reliably.
</para>
<para>
Example:
<programlisting>
promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'</programlisting>
</para> </para>
<para>
Note that the <literal>--log-to-file</literal> option will cause
output generated by the &repmgr; command, when executed by <application>repmgrd</application>,
to be logged to the same destination configured to receive log output for <application>repmgrd</application>.
</para>
<note> <note>
<para> <para>
&repmgr; will not apply <option>pg_bindir</option> when executing <option>promote_command</option> &repmgr; will not apply <option>pg_bindir</option> when executing <option>promote_command</option>
@@ -56,51 +241,204 @@
specified with the full path. specified with the full path.
</para> </para>
</note> </note>
</listitem>
</varlistentry>
<varlistentry>
<indexterm>
<primary>follow_command</primary>
</indexterm>
<term><option>follow_command</option></term>
<listitem>
<para>
The program or script defined in <option>follow_command</option> will be executed
in a failover situation when <application>repmgrd</application> determines that
the current node is to follow the new primary node.
</para>
<para>
Normally <option>follow_command</option> is set as &repmgr;'s
<command><link linkend="repmgr-standby-follow">repmgr standby promote</link></command> command.
</para>
<para>
The <option>follow_command</option> parameter
should provide the <literal>--upstream-node-id=%n</literal>
option to <command>repmgr standby follow</command>; the <literal>%n</literal> will be replaced by
<application>repmgrd</application> with the ID of the new primary node. If this is not provided,
<command>repmgr standby follow</command> will attempt to determine the new primary by itself, but if the
original primary comes back online after the new primary is promoted, there is a risk that
<command>repmgr standby follow</command> will result in the node continuing to follow
the original primary.
</para>
<para>
It is also possible to provide e.g. a shell script to e.g. perform user-defined tasks
before promoting the current node. In this case the script <emphasis>must</emphasis>
at some point execute <command><link linkend="repmgr-standby-follow">repmgr standby follow</link></command>
to promote the node; if this is not done, &repmgr; metadata will not be updated and
&repmgr; will no longer function reliably.
</para>
<para>
Example:
<programlisting>
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'</programlisting>
</para>
<para> <para>
Note that the <literal>--log-to-file</literal> option will cause Note that the <literal>--log-to-file</literal> option will cause
output generated by the &repmgr; command, when executed by <application>repmgrd</application>, output generated by the &repmgr; command, when executed by <application>repmgrd</application>,
to be logged to the same destination configured to receive log output for <application>repmgrd</application>. to be logged to the same destination configured to receive log output for <application>repmgrd</application>.
See <filename><ulink url="https://raw.githubusercontent.com/2ndQuadrant/repmgr/master/repmgr.conf.sample">repmgr.conf.sample</ulink></filename>
for further <application>repmgrd</application>-specific settings.
</para> </para>
<para>
When <varname>failover</varname> is set to <literal>automatic</literal>, upon detecting failure
of the current primary, <application>repmgrd</application> will execute one of:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>
<varname>promote_command</varname> (if the current server is to become the new primary)
</simpara>
</listitem>
<listitem>
<simpara>
<varname>follow_command</varname> (if the current server needs to follow another server which has
become the new primary)
</simpara>
</listitem>
</itemizedlist>
<note> <note>
<para> <para>
These commands can be any valid shell script which results in one of these &repmgr; will not apply <option>pg_bindir</option> when executing <option>promote_command</option>
two actions happening, but if &repmgr;'s <command>standby follow</command> or or <option>follow_command</option>; these can be user-defined scripts so must always be
<command>standby promote</command> specified with the full path.
commands are not executed (either directly as shown here, or from a script which </para>
performs other actions), the &repmgr; metadata will not be updated and </note>
&repmgr; will no longer function reliably. </listitem>
</varlistentry>
</variablelist>
</sect2>
<sect2 id="repmgrd-automatic-failover-configuration-optional">
<title>Optional configuration for automatic failover</title>
<para>
The following configuraton options can be use to fine-tune automatic failover:
</para>
<variablelist>
<varlistentry>
<indexterm>
<primary>priority</primary>
</indexterm>
<term><option>priority</option></term>
<listitem>
<para>
Indicates a preferred priority (default: <literal>100</literal>) for promoting nodes;
a value of zero prevents the node being promoted to primary.
</para>
<para>
Note that the priority setting is only applied if two or more nodes are
determined as promotion candidates; in that case the node with the
higher priority is selected.
</para>
</listitem>
</varlistentry>
<varlistentry>
<indexterm>
<primary>failover_validation_command</primary>
</indexterm>
<term><option>failover_validation_command</option></term>
<listitem>
<para>
User-defined script to execute for an external mechanism to validate the failover
decision made by <application>repmgrd</application>.
</para>
<note>
<para>
This option <emphasis>must</emphasis> be identically configured
on all nodes.
</para>
</note>
<para>
One or both of the following parameter placeholders
should be provided, which will be replaced by repmgrd with the appropriate
value:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara><literal>%n</literal>: node ID</simpara>
</listitem>
<listitem>
<simpara><literal>%a</literal>: node name</simpara>
</listitem>
</itemizedlist>
</para>
<para>
See also: <link linkend="repmgrd-failover-validation">Failover validation</link>.
</para>
</listitem>
</varlistentry>
<varlistentry>
<indexterm>
<primary>standby_disconnect_on_failover</primary>
</indexterm>
<term><option>standby_disconnect_on_failover</option></term>
<listitem>
<para>
In a failover situation, disconnect the local node's WAL receiver.
</para>
<para>
This option is available from PostgreSQL 9.5 and later.
</para>
<note>
<para>
This option <emphasis>must</emphasis> be identically configured
on all nodes.
</para>
<para>
Additionally the &repmgr; user <emphasis>must</emphasis> be a superuser
for this option.
</para>
<para>
<application>repmgrd</application> will refuse to start if this option is set
but either of these prerequisites is not met.
</para> </para>
</note> </note>
<para> <para>
The <varname>follow_command</varname> should provide the <literal>--upstream-node-id=%n</literal> See also: <link linkend="repmgrd-standby-disconnection-on-failover">Standby disconnection on failover</link>.
option to <command>repmgr standby follow</command>; the <literal>%n</literal> will be replaced by
<application>repmgrd</application> with the ID of the new primary node. If this is not provided, &repmgr;
will attempt to determine the new primary by itself, but if the
original primary comes back online after the new primary is promoted, there is a risk that
<command>repmgr standby follow</command> will result in the node continuing to follow
the original primary.
</para> </para>
</listitem>
</varlistentry>
</variablelist>
<para>
The following options can be used to further fine-tune failover behaviour.
In practice it's unlikely these will need to be changed from their default
values, but are available as configuration options should the need arise.
</para>
<variablelist>
<varlistentry>
<indexterm>
<primary>election_rerun_interval</primary>
</indexterm>
<term><option>election_rerun_interval</option></term>
<listitem>
<para>
If <option>failover_validation_command</option> is set, and the command returns
an error, pause the specified amount of seconds (default: 15) before rerunning the election.
</para>
</listitem>
</varlistentry>
<varlistentry>
<indexterm>
<primary>sibling_nodes_disconnect_timeout</primary>
</indexterm>
<term><option>sibling_nodes_disconnect_timeout</option></term>
<listitem>
<para>
If <option>standby_disconnect_on_failover</option> is <literal>true</literal>, the
maximum length of time (in seconds, default: <literal>30</literal>)
to wait for other standbys to confirm they have disconnected their
WAL receivers.
</para>
</listitem>
</varlistentry>
</variablelist>
</sect2> </sect2>
<sect2 id="postgresql-service-configuration"> <sect2 id="postgresql-service-configuration">
@@ -175,10 +513,8 @@ repmgrd_service_stop_command='sudo systemctl repmgr11 stop'
in <filename>repmgr.conf</filename>. in <filename>repmgr.conf</filename>.
</para> </para>
<para> <para>
The default monitoring interval is 2 seconds; this value can be explicitly set using: Monitoring data is written at the interval defined by
<programlisting> the option <option>monitor_interval_secs</option> (see above).
monitor_interval_secs=&lt;seconds&gt;</programlisting>
in <filename>repmgr.conf</filename>.
</para> </para>
<para> <para>
For more details on monitoring, see <xref linkend="repmgrd-monitoring">. For more details on monitoring, see <xref linkend="repmgrd-monitoring">.
@@ -228,6 +564,13 @@ repmgrd_service_stop_command='sudo systemctl repmgr11 stop'
</simpara> </simpara>
</listitem> </listitem>
<listitem>
<simpara>
<varname>connection_check_type</varname>
</simpara>
</listitem>
<listitem> <listitem>
<simpara> <simpara>
<varname>conninfo</varname> <varname>conninfo</varname>
@@ -252,6 +595,12 @@ repmgrd_service_stop_command='sudo systemctl repmgr11 stop'
</simpara> </simpara>
</listitem> </listitem>
<listitem>
<simpara>
<varname>failover_validation_command</varname>
</simpara>
</listitem>
<listitem> <listitem>
<simpara> <simpara>
<varname>failover</varname> <varname>failover</varname>
@@ -324,12 +673,30 @@ repmgrd_service_stop_command='sudo systemctl repmgr11 stop'
</simpara> </simpara>
</listitem> </listitem>
<listitem>
<simpara>
<varname>retry_promote_interval_secs</varname>
</simpara>
</listitem>
<listitem> <listitem>
<simpara> <simpara>
<varname>repmgrd_standby_startup_timeout</varname> <varname>repmgrd_standby_startup_timeout</varname>
</simpara> </simpara>
</listitem> </listitem>
<listitem>
<simpara>
<varname>sibling_nodes_disconnect_timeout</varname>
</simpara>
</listitem>
<listitem>
<simpara>
<varname>standby_disconnect_on_failover</varname>
</simpara>
</listitem>
</itemizedlist> </itemizedlist>
<para> <para>

View File

@@ -1,83 +0,0 @@
<chapter id="repmgrd-degraded-monitoring" xreflabel="repmgrd degraded monitoring">
<indexterm>
<primary>repmgrd</primary>
<secondary>degraded monitoring</secondary>
</indexterm>
<title>"degraded monitoring" mode</title>
<para>
In certain circumstances, <application>repmgrd</application> is not able to fulfill its primary mission
of monitoring the node's upstream server. In these cases it enters &quot;degraded monitoring&quot;
mode, where <application>repmgrd</application> remains active but is waiting for the situation
to be resolved.
</para>
<para>
Situations where this happens are:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>a failover situation has occurred, no nodes in the primary node's location are visible</simpara>
</listitem>
<listitem>
<simpara>a failover situation has occurred, but no promotion candidate is available</simpara>
</listitem>
<listitem>
<simpara>a failover situation has occurred, but the promotion candidate could not be promoted</simpara>
</listitem>
<listitem>
<simpara>a failover situation has occurred, but the node was unable to follow the new primary</simpara>
</listitem>
<listitem>
<simpara>a failover situation has occurred, but no primary has become available</simpara>
</listitem>
<listitem>
<simpara>a failover situation has occurred, but automatic failover is not enabled for the node</simpara>
</listitem>
<listitem>
<simpara>repmgrd is monitoring the primary node, but it is not available (and no other node has been promoted as primary)</simpara>
</listitem>
</itemizedlist>
</para>
<para>
Example output in a situation where there is only one standby with <literal>failover=manual</literal>,
and the primary node is unavailable (but is later restarted):
<programlisting>
[2017-08-29 10:59:19] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled)
[2017-08-29 10:59:33] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
[2017-08-29 10:59:33] [INFO] checking state of node 1, 1 of 5 attempts
[2017-08-29 10:59:33] [INFO] sleeping 1 seconds until next reconnection attempt
(...)
[2017-08-29 10:59:37] [INFO] checking state of node 1, 5 of 5 attempts
[2017-08-29 10:59:37] [WARNING] unable to reconnect to node 1 after 5 attempts
[2017-08-29 10:59:37] [NOTICE] this node is not configured for automatic failover so will not be considered as promotion candidate
[2017-08-29 10:59:37] [NOTICE] no other nodes are available as promotion candidate
[2017-08-29 10:59:37] [HINT] use "repmgr standby promote" to manually promote this node
[2017-08-29 10:59:37] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled)
[2017-08-29 10:59:53] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled)
[2017-08-29 11:00:45] [NOTICE] reconnected to upstream node 1 after 68 seconds, resuming monitoring
[2017-08-29 11:00:57] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled)</programlisting>
</para>
<para>
By default, <literal>repmgrd</literal> will continue in degraded monitoring mode indefinitely.
However a timeout (in seconds) can be set with <varname>degraded_monitoring_timeout</varname>,
after which <application>repmgrd</application> will terminate.
</para>
<note>
<para>
If <application>repmgrd</application> is monitoring a primary mode which has been stopped
and manually restarted as a standby attached to a new primary, it will automatically detect
the status change and update the node record to reflect the node's new status
as an active standby. It will then resume monitoring the node as a standby.
</para>
</note>
</chapter>

View File

@@ -1,80 +0,0 @@
<chapter id="repmgrd-monitoring" xreflabel="Monitoring with repmgrd">
<indexterm>
<primary>repmgrd</primary>
<secondary>monitoring</secondary>
</indexterm>
<indexterm>
<primary>monitoring</primary>
<secondary>with repmgrd</secondary>
</indexterm>
<title>Monitoring with repmgrd</title>
<para>
When <application>repmgrd</application> is running with the option <literal>monitoring_history=true</literal>,
it will constantly write standby node status information to the
<varname>monitoring_history</varname> table, providing a near-real time
overview of replication status on all nodes
in the cluster.
</para>
<para>
The view <literal>replication_status</literal> shows the most recent state
for each node, e.g.:
<programlisting>
repmgr=# select * from repmgr.replication_status;
-[ RECORD 1 ]-------------+------------------------------
primary_node_id | 1
standby_node_id | 2
standby_name | node2
node_type | standby
active | t
last_monitor_time | 2017-08-24 16:28:41.260478+09
last_wal_primary_location | 0/6D57A00
last_wal_standby_location | 0/5000000
replication_lag | 29 MB
replication_time_lag | 00:00:11.736163
apply_lag | 15 MB
communication_time_lag | 00:00:01.365643</programlisting>
</para>
<para>
The interval in which monitoring history is written is controlled by the
configuration parameter <varname>monitor_interval_secs</varname>;
default is 2.
</para>
<para>
As this can generate a large amount of monitoring data in the table
<literal>repmgr.monitoring_history</literal>. it's advisable to regularly
purge historical data using the <xref linkend="repmgr-cluster-cleanup">
command; use the <literal>-k/--keep-history</literal> option to
specify how many day's worth of data should be retained.
</para>
<para>
It's possible to use <application>repmgrd</application> to run in monitoring
mode only (without automatic failover capability) for some or all
nodes by setting <literal>failover=manual</literal> in the node's
<filename>repmgr.conf</filename> file. In the event of the node's upstream failing,
no failover action will be taken and the node will require manual intervention to
be reattached to replication. If this occurs, an
<link linkend="event-notifications">event notification</link>
<varname>standby_disconnect_manual</varname> will be created.
</para>
<para>
Note that when a standby node is not streaming directly from its upstream
node, e.g. recovering WAL from an archive, <varname>apply_lag</varname> will always appear as
<literal>0 bytes</literal>.
</para>
<tip>
<para>
If monitoring history is enabled, the contents of the <literal>repmgr.monitoring_history</literal>
table will be replicated to attached standbys. This means there will be a small but
constant stream of replication activity which may not be desirable. To prevent
this, convert the table to an <literal>UNLOGGED</literal> one with:
<programlisting>
ALTER TABLE repmgr.monitoring_history SET UNLOGGED;</programlisting>
</para>
<para>
This will however mean that monitoring history will not be available on
another node following a failover, and the view <literal>repmgr.replication_status</literal>
will not work on standbys.
</para>
</tip>
</chapter>

View File

@@ -1,48 +0,0 @@
<chapter id="repmgrd-network-split" xreflabel="Handling network splits with repmgrd">
<indexterm>
<primary>repmgrd</primary>
<secondary>network splits</secondary>
</indexterm>
<title>Handling network splits with repmgrd</title>
<para>
A common pattern for replication cluster setups is to spread servers over
more than one datacentre. This can provide benefits such as geographically-
distributed read replicas and DR (disaster recovery capability). However
this also means there is a risk of disconnection at network level between
datacentre locations, which would result in a split-brain scenario if
servers in a secondary data centre were no longer able to see the primary
in the main data centre and promoted a standby among themselves.
</para>
<para>
&repmgr; enables provision of &quot;<xref linkend="witness-server">&quot; to
artificially create a quorum of servers in a particular location, ensuring
that nodes in another location will not elect a new primary if they
are unable to see the majority of nodes. However this approach does not
scale well, particularly with more complex replication setups, e.g.
where the majority of nodes are located outside of the primary datacentre.
It also means the <literal>witness</literal> node needs to be managed as an
extra PostgreSQL instance outside of the main replication cluster, which
adds administrative and programming complexity.
</para>
<para>
<literal>repmgr4</literal> introduces the concept of <literal>location</literal>:
each node is associated with an arbitrary location string (default is
<literal>default</literal>); this is set in <filename>repmgr.conf</filename>, e.g.:
<programlisting>
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/data'
location='dc1'</programlisting>
</para>
<para>
In a failover situation, <application>repmgrd</application> will check if any servers in the
same location as the current primary node are visible. If not, <application>repmgrd</application>
will assume a network interruption and not promote any node in any
other location (it will however enter <link linkend="repmgrd-degraded-monitoring">degraded monitoring</link>
mode until a primary becomes visible).
</para>
</chapter>

View File

@@ -1,38 +0,0 @@
<chapter id="repmgrd-notes" xreflabel="repmgrd notes">
<indexterm>
<primary>repmgrd</primary>
<secondary>notes</secondary>
</indexterm>
<title>repmgrd notes</title>
<sect1 id="repmgrd-wal-replay-pause">
<indexterm>
<primary>repmgrd</primary>
<secondary>paused WAL replay</secondary>
</indexterm>
<title>repmgrd and paused WAL replay</title>
<para>
If WAL replay has been paused (using <command>pg_wal_replay_pause()</command>,
on PostgreSQL 9.6 and earlier <command>pg_xlog_replay_pause()</command>),
in a failover situation <application>repmgrd</application> will
automatically resume WAL replay.
</para>
<para>
This is because if WAL replay is paused, but WAL is pending replay,
PostgreSQL cannot be promoted until WAL replay is resumed.
</para>
<note>
<para>
<command><link linkend="repmgr-standby-promote">repmgr standby promote</link></command>
will refuse to promote a node in this state, as the PostgreSQL
<command>promote</command> command will not be acted on until
WAL replay is resumed, leaving the cluster in a potentially
unstable state. In this case it is up to the user to
decide whether to resume WAL replay.
</para>
</note>
</sect1>
</chapter>

386
doc/repmgrd-operation.sgml Normal file
View File

@@ -0,0 +1,386 @@
<chapter id="repmgrd-operation" xreflabel="repmgrd operation">
<indexterm>
<primary>repmgrd</primary>
<secondary>operation</secondary>
</indexterm>
<title>repmgrd operation</title>
<sect1 id="repmgrd-pausing">
<indexterm>
<primary>repmgrd</primary>
<secondary>pausing</secondary>
</indexterm>
<indexterm>
<primary>pausing repmgrd</primary>
</indexterm>
<title>Pausing repmgrd</title>
<para>
In normal operation, <application>repmgrd</application> monitors the state of the
PostgreSQL node it is running on, and will take appropriate action if problems
are detected, e.g. (if so configured) promote the node to primary, if the existing
primary has been determined as failed.
</para>
<para>
However, <application>repmgrd</application> is unable to distinguish between
planned outages (such as performing a <link linkend="performing-switchover">switchover</link>
or installing PostgreSQL maintenance released), and an actual server outage. In versions prior to
&repmgr; 4.2 it was necessary to stop <application>repmgrd</application> on all nodes (or at least
on all nodes where <application>repmgrd</application> is
<link linkend="repmgrd-automatic-failover">configured for automatic failover</link>)
to prevent <application>repmgrd</application> from making unintentional changes to the
replication cluster.
</para>
<para>
From <link linkend="release-4.2">&repmgr; 4.2</link>, <application>repmgrd</application>
can now be &quot;paused&quot;, i.e. instructed not to take any action such as performing a failover.
This can be done from any node in the cluster, removing the need to stop/restart
each <application>repmgrd</application> individually.
</para>
<note>
<para>
For major PostgreSQL upgrades, e.g. from PostgreSQL 10 to PostgreSQL 11,
<application>repmgrd</application> should be shut down completely and only started up
once the &repmgr; packages for the new PostgreSQL major version have been installed.
</para>
</note>
<sect2 id="repmgrd-pausing-prerequisites">
<title>Prerequisites for pausing <application>repmgrd</application></title>
<para>
In order to be able to pause/unpause <application>repmgrd</application>, following
prerequisites must be met:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara><link linkend="release-4.2">&repmgr; 4.2</link> or later must be installed on all nodes.</simpara>
</listitem>
<listitem>
<simpara>The same major &repmgr; version (e.g. 4.2) must be installed on all nodes (and preferably the same minor version).</simpara>
</listitem>
<listitem>
<simpara>
PostgreSQL on all nodes must be accessible from the node where the
<literal>pause</literal>/<literal>unpause</literal> operation is executed, using the
<varname>conninfo</varname> string shown by <link linkend="repmgr-cluster-show"><command>repmgr cluster show</command></link>.
</simpara>
</listitem>
</itemizedlist>
</para>
<note>
<para>
These conditions are required for normal &repmgr; operation in any case.
</para>
</note>
</sect2>
<sect2 id="repmgrd-pausing-execution">
<title>Pausing/unpausing <application>repmgrd</application></title>
<para>
To pause <application>repmgrd</application>, execute <link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link>, e.g.:
<programlisting>
$ repmgr -f /etc/repmgr.conf daemon pause
NOTICE: node 1 (node1) paused
NOTICE: node 2 (node2) paused
NOTICE: node 3 (node3) paused</programlisting>
</para>
<para>
The state of <application>repmgrd</application> on each node can be checked with
<link linkend="repmgr-daemon-status"><command>repmgr daemon status</command></link>, e.g.:
<programlisting>$ repmgr -f /etc/repmgr.conf daemon status
ID | Name | Role | Status | repmgrd | PID | Paused?
----+-------+---------+---------+---------+------+---------
1 | node1 | primary | running | running | 7851 | yes
2 | node2 | standby | running | running | 7889 | yes
3 | node3 | standby | running | running | 7918 | yes</programlisting>
</para>
<note>
<para>
If executing a switchover with <link linkend="repmgr-standby-switchover"><command>repmgr standby switchover</command></link>,
&repmgr; will automatically pause/unpause <application>repmgrd</application> as part of the switchover process.
</para>
</note>
<para>
If the primary (in this example, <literal>node1</literal>) is stopped, <application>repmgrd</application>
running on one of the standbys (here: <literal>node2</literal>) will react like this:
<programlisting>
[2018-09-20 12:22:21] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
[2018-09-20 12:22:21] [INFO] checking state of node 1, 1 of 5 attempts
[2018-09-20 12:22:21] [INFO] sleeping 1 seconds until next reconnection attempt
...
[2018-09-20 12:22:24] [INFO] sleeping 1 seconds until next reconnection attempt
[2018-09-20 12:22:25] [INFO] checking state of node 1, 5 of 5 attempts
[2018-09-20 12:22:25] [WARNING] unable to reconnect to node 1 after 5 attempts
[2018-09-20 12:22:25] [NOTICE] node is paused
[2018-09-20 12:22:33] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state
[2018-09-20 12:22:33] [DETAIL] repmgrd paused by administrator
[2018-09-20 12:22:33] [HINT] execute "repmgr daemon unpause" to resume normal failover mode</programlisting>
</para>
<para>
If the primary becomes available again (e.g. following a software upgrade), <application>repmgrd</application>
will automatically reconnect, e.g.:
<programlisting>
[2018-09-20 13:12:41] [NOTICE] reconnected to upstream node 1 after 8 seconds, resuming monitoring</programlisting>
</para>
<para>
To unpause <application>repmgrd</application>, execute <link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>, e.g.:
<programlisting>
$ repmgr -f /etc/repmgr.conf daemon unpause
NOTICE: node 1 (node1) unpaused
NOTICE: node 2 (node2) unpaused
NOTICE: node 3 (node3) unpaused</programlisting>
</para>
<note>
<para>
If the previous primary is no longer accessible when <application>repmgrd</application>
is unpaused, no failover action will be taken. Instead, a new primary must be manually promoted using
<link linkend="repmgr-standby-promote"><command>repmgr standby promote</command></link>,
and any standbys attached to the new primary with
<link linkend="repmgr-standby-follow"><command>repmgr standby follow</command></link>.
</para>
<para>
This is to prevent <link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>
resulting in the automatic promotion of a new primary, which may be a problem particularly
in larger clusters, where <application>repmgrd</application> could select a different promotion
candidate to the one intended by the administrator.
</para>
</note>
</sect2>
<sect2 id="repmgrd-pausing-details">
<title>Details on the <application>repmgrd</application> pausing mechanism</title>
<para>
The pause state of each node will be stored over a PostgreSQL restart.
</para>
<para>
<link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link> and
<link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link> can be
executed even if <application>repmgrd</application> is not running; in this case,
<application>repmgrd</application> will start up in whichever pause state has been set.
</para>
<note>
<para>
<link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link> and
<link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>
<emphasis>do not</emphasis> stop/start <application>repmgrd</application>.
</para>
</note>
</sect2>
</sect1>
<sect1 id="repmgrd-wal-replay-pause">
<indexterm>
<primary>repmgrd</primary>
<secondary>paused WAL replay</secondary>
</indexterm>
<title>repmgrd and paused WAL replay</title>
<para>
If WAL replay has been paused (using <command>pg_wal_replay_pause()</command>,
on PostgreSQL 9.6 and earlier <command>pg_xlog_replay_pause()</command>),
in a failover situation <application>repmgrd</application> will
automatically resume WAL replay.
</para>
<para>
This is because if WAL replay is paused, but WAL is pending replay,
PostgreSQL cannot be promoted until WAL replay is resumed.
</para>
<note>
<para>
<command><link linkend="repmgr-standby-promote">repmgr standby promote</link></command>
will refuse to promote a node in this state, as the PostgreSQL
<command>promote</command> command will not be acted on until
WAL replay is resumed, leaving the cluster in a potentially
unstable state. In this case it is up to the user to
decide whether to resume WAL replay.
</para>
</note>
</sect1>
<sect1 id="repmgrd-degraded-monitoring" xreflabel="repmgrd degraded monitoring">
<indexterm>
<primary>repmgrd</primary>
<secondary>degraded monitoring</secondary>
</indexterm>
<indexterm>
<primary>degraded monitoring</primary>
</indexterm>
<title>"degraded monitoring" mode</title>
<para>
In certain circumstances, <application>repmgrd</application> is not able to fulfill its primary mission
of monitoring the node's upstream server. In these cases it enters &quot;degraded monitoring&quot;
mode, where <application>repmgrd</application> remains active but is waiting for the situation
to be resolved.
</para>
<para>
Situations where this happens are:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara>a failover situation has occurred, no nodes in the primary node's location are visible</simpara>
</listitem>
<listitem>
<simpara>a failover situation has occurred, but no promotion candidate is available</simpara>
</listitem>
<listitem>
<simpara>a failover situation has occurred, but the promotion candidate could not be promoted</simpara>
</listitem>
<listitem>
<simpara>a failover situation has occurred, but the node was unable to follow the new primary</simpara>
</listitem>
<listitem>
<simpara>a failover situation has occurred, but no primary has become available</simpara>
</listitem>
<listitem>
<simpara>a failover situation has occurred, but automatic failover is not enabled for the node</simpara>
</listitem>
<listitem>
<simpara>repmgrd is monitoring the primary node, but it is not available (and no other node has been promoted as primary)</simpara>
</listitem>
</itemizedlist>
</para>
<para>
Example output in a situation where there is only one standby with <literal>failover=manual</literal>,
and the primary node is unavailable (but is later restarted):
<programlisting>
[2017-08-29 10:59:19] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled)
[2017-08-29 10:59:33] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
[2017-08-29 10:59:33] [INFO] checking state of node 1, 1 of 5 attempts
[2017-08-29 10:59:33] [INFO] sleeping 1 seconds until next reconnection attempt
(...)
[2017-08-29 10:59:37] [INFO] checking state of node 1, 5 of 5 attempts
[2017-08-29 10:59:37] [WARNING] unable to reconnect to node 1 after 5 attempts
[2017-08-29 10:59:37] [NOTICE] this node is not configured for automatic failover so will not be considered as promotion candidate
[2017-08-29 10:59:37] [NOTICE] no other nodes are available as promotion candidate
[2017-08-29 10:59:37] [HINT] use "repmgr standby promote" to manually promote this node
[2017-08-29 10:59:37] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled)
[2017-08-29 10:59:53] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled)
[2017-08-29 11:00:45] [NOTICE] reconnected to upstream node 1 after 68 seconds, resuming monitoring
[2017-08-29 11:00:57] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled)</programlisting>
</para>
<para>
By default, <literal>repmgrd</literal> will continue in degraded monitoring mode indefinitely.
However a timeout (in seconds) can be set with <varname>degraded_monitoring_timeout</varname>,
after which <application>repmgrd</application> will terminate.
</para>
<note>
<para>
If <application>repmgrd</application> is monitoring a primary mode which has been stopped
and manually restarted as a standby attached to a new primary, it will automatically detect
the status change and update the node record to reflect the node's new status
as an active standby. It will then resume monitoring the node as a standby.
</para>
</note>
</sect1>
<sect1 id="repmgrd-monitoring" xreflabel="Storing monitoring data">
<indexterm>
<primary>repmgrd</primary>
<secondary>monitoring</secondary>
</indexterm>
<indexterm>
<primary>monitoring</primary>
<secondary>with repmgrd</secondary>
</indexterm>
<title>Storing monitoring data</title>
<para>
When <application>repmgrd</application> is running with the option <literal>monitoring_history=true</literal>,
it will constantly write standby node status information to the
<varname>monitoring_history</varname> table, providing a near-real time
overview of replication status on all nodes
in the cluster.
</para>
<para>
The view <literal>replication_status</literal> shows the most recent state
for each node, e.g.:
<programlisting>
repmgr=# select * from repmgr.replication_status;
-[ RECORD 1 ]-------------+------------------------------
primary_node_id | 1
standby_node_id | 2
standby_name | node2
node_type | standby
active | t
last_monitor_time | 2017-08-24 16:28:41.260478+09
last_wal_primary_location | 0/6D57A00
last_wal_standby_location | 0/5000000
replication_lag | 29 MB
replication_time_lag | 00:00:11.736163
apply_lag | 15 MB
communication_time_lag | 00:00:01.365643</programlisting>
</para>
<para>
The interval in which monitoring history is written is controlled by the
configuration parameter <varname>monitor_interval_secs</varname>;
default is 2.
</para>
<para>
As this can generate a large amount of monitoring data in the table
<literal>repmgr.monitoring_history</literal>. it's advisable to regularly
purge historical data using the <xref linkend="repmgr-cluster-cleanup">
command; use the <literal>-k/--keep-history</literal> option to
specify how many day's worth of data should be retained.
</para>
<para>
It's possible to use <application>repmgrd</application> to run in monitoring
mode only (without automatic failover capability) for some or all
nodes by setting <literal>failover=manual</literal> in the node's
<filename>repmgr.conf</filename> file. In the event of the node's upstream failing,
no failover action will be taken and the node will require manual intervention to
be reattached to replication. If this occurs, an
<link linkend="event-notifications">event notification</link>
<varname>standby_disconnect_manual</varname> will be created.
</para>
<para>
Note that when a standby node is not streaming directly from its upstream
node, e.g. recovering WAL from an archive, <varname>apply_lag</varname> will always appear as
<literal>0 bytes</literal>.
</para>
<tip>
<para>
If monitoring history is enabled, the contents of the <literal>repmgr.monitoring_history</literal>
table will be replicated to attached standbys. This means there will be a small but
constant stream of replication activity which may not be desirable. To prevent
this, convert the table to an <literal>UNLOGGED</literal> one with:
<programlisting>
ALTER TABLE repmgr.monitoring_history SET UNLOGGED;</programlisting>
</para>
<para>
This will however mean that monitoring history will not be available on
another node following a failover, and the view <literal>repmgr.replication_status</literal>
will not work on standbys.
</para>
</tip>
</sect1>
</chapter>

View File

@@ -1,4 +1,21 @@
<chapter id="repmgrd-demonstration"> <chapter id="repmgrd-overview" xreflabel="repmgrd overview">
<indexterm>
<primary>repmgrd</primary>
<secondary>overview</secondary>
</indexterm>
<title>repmgrd overview</title>
<para>
<application>repmgrd</application> (&quot;<literal>replication manager daemon</literal>&quot;)
is a management and monitoring daemon which runs
on each node in a replication cluster. It can automate actions such as
failover and updating standbys to follow the new primary, as well as
providing monitoring information about the state of each standby.
</para>
<sect1 id="repmgrd-demonstration">
<title>repmgrd demonstration</title> <title>repmgrd demonstration</title>
<para> <para>
To demonstrate automatic failover, set up a 3-node replication cluster (one primary To demonstrate automatic failover, set up a 3-node replication cluster (one primary
@@ -12,6 +29,13 @@
2 | node2 | standby | running | node1 | default | host=node2 dbname=repmgr user=repmgr 2 | node2 | standby | running | node1 | default | host=node2 dbname=repmgr user=repmgr
3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr</programlisting> 3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr</programlisting>
</para> </para>
<tip>
<para>
See section <link linkend="repmgrd-automatic-failover-configuration">Required configuration for automatic failover</link>
for an example of minimal <filename>repmgr.conf</filename> file settings suitable for use with <application>repmgrd</application>.
</para>
</tip>
<para> <para>
Start <application>repmgrd</application> on each standby and verify that it's running by examining the Start <application>repmgrd</application> on each standby and verify that it's running by examining the
log output, which at log level <literal>INFO</literal> will look like this: log output, which at log level <literal>INFO</literal> will look like this:
@@ -93,4 +117,6 @@
2 | node2 | repmgrd_failover_promote | t | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed 2 | node2 | repmgrd_failover_promote | t | 2017-08-24 23:32:13 | node 2 promoted to primary; old primary 1 marked as failed
2 | node2 | standby_promote | t | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary</programlisting> 2 | node2 | standby_promote | t | 2017-08-24 23:32:13 | node 2 was successfully promoted to primary</programlisting>
</para> </para>
</sect1>
</chapter> </chapter>

View File

@@ -1,178 +0,0 @@
<chapter id="repmgrd-pausing" xreflabel="Pausing repmgrd">
<indexterm>
<primary>repmgrd</primary>
<secondary>pausing</secondary>
</indexterm>
<indexterm>
<primary>pausing repmgrd</primary>
</indexterm>
<title>Pausing repmgrd</title>
<para>
In normal operation, <application>repmgrd</application> monitors the state of the
PostgreSQL node it is running on, and will take appropriate action if problems
are detected, e.g. (if so configured) promote the node to primary, if the existing
primary has been determined as failed.
</para>
<para>
However, <application>repmgrd</application> is unable to distinguish between
planned outages (such as performing a <link linkend="performing-switchover">switchover</link>
or installing PostgreSQL maintenance released), and an actual server outage. In versions prior to
&repmgr; 4.2 it was necessary to stop <application>repmgrd</application> on all nodes (or at least
on all nodes where <application>repmgrd</application> is
<link linkend="repmgrd-automatic-failover">configured for automatic failover</link>)
to prevent <application>repmgrd</application> from making unintentional changes to the
replication cluster.
</para>
<para>
From <link linkend="release-4.2">&repmgr; 4.2</link>, <application>repmgrd</application>
can now be &quot;paused&quot;, i.e. instructed not to take any action such as performing a failover.
This can be done from any node in the cluster, removing the need to stop/restart
each <application>repmgrd</application> individually.
</para>
<note>
<para>
For major PostgreSQL upgrades, e.g. from PostgreSQL 10 to PostgreSQL 11,
<application>repmgrd</application> should be shut down completely and only started up
once the &repmgr; packages for the new PostgreSQL major version have been installed.
</para>
</note>
<sect1 id="repmgrd-pausing-prerequisites">
<title>Prerequisites for pausing <application>repmgrd</application></title>
<para>
In order to be able to pause/unpause <application>repmgrd</application>, following
prerequisites must be met:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<simpara><link linkend="release-4.2">&repmgr; 4.2</link> or later must be installed on all nodes.</simpara>
</listitem>
<listitem>
<simpara>The same major &repmgr; version (e.g. 4.2) must be installed on all nodes (and preferably the same minor version).</simpara>
</listitem>
<listitem>
<simpara>
PostgreSQL on all nodes must be accessible from the node where the
<literal>pause</literal>/<literal>unpause</literal> operation is executed, using the
<varname>conninfo</varname> string shown by <link linkend="repmgr-cluster-show"><command>repmgr cluster show</command></link>.
</simpara>
</listitem>
</itemizedlist>
</para>
<note>
<para>
These conditions are required for normal &repmgr; operation in any case.
</para>
</note>
</sect1>
<sect1 id="repmgrd-pausing-execution">
<title>Pausing/unpausing <application>repmgrd</application></title>
<para>
To pause <application>repmgrd</application>, execute <link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link>, e.g.:
<programlisting>
$ repmgr -f /etc/repmgr.conf daemon pause
NOTICE: node 1 (node1) paused
NOTICE: node 2 (node2) paused
NOTICE: node 3 (node3) paused</programlisting>
</para>
<para>
The state of <application>repmgrd</application> on each node can be checked with
<link linkend="repmgr-daemon-status"><command>repmgr daemon status</command></link>, e.g.:
<programlisting>$ repmgr -f /etc/repmgr.conf daemon status
ID | Name | Role | Status | repmgrd | PID | Paused?
----+-------+---------+---------+---------+------+---------
1 | node1 | primary | running | running | 7851 | yes
2 | node2 | standby | running | running | 7889 | yes
3 | node3 | standby | running | running | 7918 | yes</programlisting>
</para>
<note>
<para>
If executing a switchover with <link linkend="repmgr-standby-switchover"><command>repmgr standby switchover</command></link>,
&repmgr; will automatically pause/unpause <application>repmgrd</application> as part of the switchover process.
</para>
</note>
<para>
If the primary (in this example, <literal>node1</literal>) is stopped, <application>repmgrd</application>
running on one of the standbys (here: <literal>node2</literal>) will react like this:
<programlisting>
[2018-09-20 12:22:21] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
[2018-09-20 12:22:21] [INFO] checking state of node 1, 1 of 5 attempts
[2018-09-20 12:22:21] [INFO] sleeping 1 seconds until next reconnection attempt
...
[2018-09-20 12:22:24] [INFO] sleeping 1 seconds until next reconnection attempt
[2018-09-20 12:22:25] [INFO] checking state of node 1, 5 of 5 attempts
[2018-09-20 12:22:25] [WARNING] unable to reconnect to node 1 after 5 attempts
[2018-09-20 12:22:25] [NOTICE] node is paused
[2018-09-20 12:22:33] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state
[2018-09-20 12:22:33] [DETAIL] repmgrd paused by administrator
[2018-09-20 12:22:33] [HINT] execute "repmgr daemon unpause" to resume normal failover mode</programlisting>
</para>
<para>
If the primary becomes available again (e.g. following a software upgrade), <application>repmgrd</application>
will automatically reconnect, e.g.:
<programlisting>
[2018-09-20 13:12:41] [NOTICE] reconnected to upstream node 1 after 8 seconds, resuming monitoring</programlisting>
</para>
<para>
To unpause <application>repmgrd</application>, execute <link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>, e.g.:
<programlisting>
$ repmgr -f /etc/repmgr.conf daemon unpause
NOTICE: node 1 (node1) unpaused
NOTICE: node 2 (node2) unpaused
NOTICE: node 3 (node3) unpaused</programlisting>
</para>
<note>
<para>
If the previous primary is no longer accessible when <application>repmgrd</application>
is unpaused, no failover action will be taken. Instead, a new primary must be manually promoted using
<link linkend="repmgr-standby-promote"><command>repmgr standby promote</command></link>,
and any standbys attached to the new primary with
<link linkend="repmgr-standby-follow"><command>repmgr standby follow</command></link>.
</para>
<para>
This is to prevent <link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>
resulting in the automatic promotion of a new primary, which may be a problem particularly
in larger clusters, where <application>repmgrd</application> could select a different promotion
candidate to the one intended by the administrator.
</para>
</note>
<sect2 id="repmgrd-pausing-details">
<title>Details on the <application>repmgrd</application> pausing mechanism</title>
<para>
The pause state of each node will be stored over a PostgreSQL restart.
</para>
<para>
<link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link> and
<link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link> can be
executed even if <application>repmgrd</application> is not running; in this case,
<application>repmgrd</application> will start up in whichever pause state has been set.
</para>
<note>
<para>
<link linkend="repmgr-daemon-pause"><command>repmgr daemon pause</command></link> and
<link linkend="repmgr-daemon-unpause"><command>repmgr daemon unpause</command></link>
<emphasis>do not</emphasis> stop/start <application>repmgrd</application>.
</para>
</note>
</sect2>
</sect1>
</chapter>

View File

@@ -1,31 +0,0 @@
<chapter id="repmgrd-witness-server" xreflabel="Using a witness server with repmgrd">
<indexterm>
<primary>repmgrd</primary>
<secondary>witness server</secondary>
</indexterm>
<title>Using a witness server with repmgrd</title>
<para>
In a situation caused e.g. by a network interruption between two
data centres, it's important to avoid a "split-brain" situation where
both sides of the network assume they are the active segment and the
side without an active primary unilaterally promotes one of its standbys.
</para>
<para>
To prevent this situation happening, it's essential to ensure that one
network segment has a "voting majority", so other segments will know
they're in the minority and not attempt to promote a new primary. Where
an odd number of servers exists, this is not an issue. However, if each
network has an even number of nodes, it's necessary to provide some way
of ensuring a majority, which is where the witness server becomes useful.
</para>
<para>
This is not a fully-fledged standby node and is not integrated into
replication, but it effectively represents the "casting vote" when
deciding which network segment has a majority. A witness server can
be set up using <xref linkend="repmgr-witness-register">. Note that it only
makes sense to create a witness server in conjunction with running
<application>repmgrd</application>; the witness server will require its own
<application>repmgrd</application> instance.
</para>
</chapter>

View File

@@ -1 +0,0 @@
<!ENTITY repmgrversion "4.3dev">

View File

@@ -1,12 +1,17 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION -- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION repmgr" to load this file. \quit \echo Use "CREATE EXTENSION repmgr" to load this file. \quit
CREATE FUNCTION set_primary_last_seen() CREATE FUNCTION set_upstream_last_seen()
RETURNS VOID RETURNS VOID
AS 'MODULE_PATHNAME', 'set_primary_last_seen' AS 'MODULE_PATHNAME', 'set_upstream_last_seen'
LANGUAGE C STRICT; LANGUAGE C STRICT;
CREATE FUNCTION get_primary_last_seen() CREATE FUNCTION get_upstream_last_seen()
RETURNS INT RETURNS INT
AS 'MODULE_PATHNAME', 'get_primary_last_seen' AS 'MODULE_PATHNAME', 'get_upstream_last_seen'
LANGUAGE C STRICT;
CREATE FUNCTION get_wal_receiver_pid()
RETURNS INT
AS 'MODULE_PATHNAME', 'get_wal_receiver_pid'
LANGUAGE C STRICT; LANGUAGE C STRICT;

View File

@@ -118,16 +118,17 @@ CREATE FUNCTION standby_get_last_updated()
AS 'MODULE_PATHNAME', 'standby_get_last_updated' AS 'MODULE_PATHNAME', 'standby_get_last_updated'
LANGUAGE C STRICT; LANGUAGE C STRICT;
CREATE FUNCTION set_primary_last_seen() CREATE FUNCTION set_upstream_last_seen()
RETURNS VOID RETURNS VOID
AS 'MODULE_PATHNAME', 'set_primary_last_seen' AS 'MODULE_PATHNAME', 'set_upstream_last_seen'
LANGUAGE C STRICT; LANGUAGE C STRICT;
CREATE FUNCTION get_primary_last_seen() CREATE FUNCTION get_upstream_last_seen()
RETURNS INT RETURNS INT
AS 'MODULE_PATHNAME', 'get_primary_last_seen' AS 'MODULE_PATHNAME', 'get_upstream_last_seen'
LANGUAGE C STRICT; LANGUAGE C STRICT;
/* failover functions */ /* failover functions */
CREATE FUNCTION notify_follow_primary(INT) CREATE FUNCTION notify_follow_primary(INT)
@@ -185,6 +186,15 @@ CREATE FUNCTION repmgrd_is_paused()
AS 'MODULE_PATHNAME', 'repmgrd_is_paused' AS 'MODULE_PATHNAME', 'repmgrd_is_paused'
LANGUAGE C STRICT; LANGUAGE C STRICT;
CREATE FUNCTION get_wal_receiver_pid()
RETURNS INT
AS 'MODULE_PATHNAME', 'get_wal_receiver_pid'
LANGUAGE C STRICT;
/* views */
CREATE VIEW repmgr.replication_status AS CREATE VIEW repmgr.replication_status AS
SELECT m.primary_node_id, m.standby_node_id, n.node_name AS standby_name, SELECT m.primary_node_id, m.standby_node_id, n.node_name AS standby_name,

View File

@@ -1161,6 +1161,7 @@ build_cluster_matrix(t_node_matrix_rec ***matrix_rec_dest, int *name_length, Ite
(void) remote_command(host, (void) remote_command(host,
runtime_options.remote_user, runtime_options.remote_user,
command.data, command.data,
config_file_options.ssh_options,
&command_output); &command_output);
p = command_output.data; p = command_output.data;
@@ -1373,6 +1374,7 @@ build_cluster_crosscheck(t_node_status_cube ***dest_cube, int *name_length, Item
(void) remote_command(host, (void) remote_command(host,
runtime_options.remote_user, runtime_options.remote_user,
quoted_command.data, quoted_command.data,
config_file_options.ssh_options,
&command_output); &command_output);
free_conninfo_params(&remote_conninfo); free_conninfo_params(&remote_conninfo);

View File

@@ -201,8 +201,7 @@ do_daemon_status(void)
} }
} }
repmgrd_info[i]->upstream_last_seen = get_primary_last_seen(cell->node_info->conn); repmgrd_info[i]->upstream_last_seen = get_upstream_last_seen(cell->node_info->conn, cell->node_info->type);
if (repmgrd_info[i]->upstream_last_seen < 0) if (repmgrd_info[i]->upstream_last_seen < 0)
{ {
maxlen_snprintf(repmgrd_info[i]->upstream_last_seen_text, "%s", _("n/a")); maxlen_snprintf(repmgrd_info[i]->upstream_last_seen_text, "%s", _("n/a"));
@@ -260,14 +259,24 @@ do_daemon_status(void)
{ {
if (runtime_options.output_mode == OM_CSV) if (runtime_options.output_mode == OM_CSV)
{ {
int running = repmgrd_info[i]->running ? 1 : 0;
int paused = repmgrd_info[i]->paused ? 1 : 0;
/* If PostgreSQL is not running, repmgrd status is unknown */
if (repmgrd_info[i]->pg_running == false)
{
running = -1;
paused = -1;
}
printf("%i,%s,%s,%i,%i,%i,%i,%i,%i\n", printf("%i,%s,%s,%i,%i,%i,%i,%i,%i\n",
cell->node_info->node_id, cell->node_info->node_id,
cell->node_info->node_name, cell->node_info->node_name,
get_node_type_string(cell->node_info->type), get_node_type_string(cell->node_info->type),
repmgrd_info[i]->pg_running ? 1 : 0, repmgrd_info[i]->pg_running ? 1 : 0,
repmgrd_info[i]->running ? 1 : 0, running,
repmgrd_info[i]->pid, repmgrd_info[i]->pid,
repmgrd_info[i]->paused ? 1 : 0, paused,
cell->node_info->priority, cell->node_info->priority,
repmgrd_info[i]->pid == UNKNOWN_PID repmgrd_info[i]->pid == UNKNOWN_PID
? -1 ? -1
@@ -344,18 +353,9 @@ _do_repmgr_pause(bool pause)
PGconn *conn = NULL; PGconn *conn = NULL;
NodeInfoList nodes = T_NODE_INFO_LIST_INITIALIZER; NodeInfoList nodes = T_NODE_INFO_LIST_INITIALIZER;
NodeInfoListCell *cell = NULL; NodeInfoListCell *cell = NULL;
RepmgrdInfo **repmgrd_info;
int i; int i;
int error_nodes = 0; int error_nodes = 0;
repmgrd_info = (RepmgrdInfo **) pg_malloc0(sizeof(RepmgrdInfo *) * nodes.node_count);
if (repmgrd_info == NULL)
{
log_error(_("unable to allocate memory"));
exit(ERR_OUT_OF_MEMORY);
}
/* Connect to local database to obtain cluster connection data */ /* Connect to local database to obtain cluster connection data */
log_verbose(LOG_INFO, _("connecting to database")); log_verbose(LOG_INFO, _("connecting to database"));
@@ -370,9 +370,6 @@ _do_repmgr_pause(bool pause)
for (cell = nodes.head; cell; cell = cell->next) for (cell = nodes.head; cell; cell = cell->next)
{ {
repmgrd_info[i] = pg_malloc0(sizeof(RepmgrdInfo));
repmgrd_info[i]->node_id = cell->node_info->node_id;
log_verbose(LOG_DEBUG, "pausing node %i (%s)", log_verbose(LOG_DEBUG, "pausing node %i (%s)",
cell->node_info->node_id, cell->node_info->node_id,
cell->node_info->node_name); cell->node_info->node_name);

View File

@@ -413,7 +413,7 @@ do_node_status(void)
node_info.upstream_node_name, node_info.upstream_node_name,
node_info.upstream_node_id); node_info.upstream_node_id);
get_replication_info(conn, &replication_info); get_replication_info(conn, node_info.type, &replication_info);
key_value_list_set_format(&node_status, key_value_list_set_format(&node_status,
"Replication lag", "Replication lag",
@@ -2681,6 +2681,48 @@ do_node_rejoin(void)
} }
/*
* Currently for testing purposes only, not documented;
* use at own risk!
*/
void
do_node_control(void)
{
PGconn *conn = NULL;
pid_t wal_receiver_pid = UNKNOWN_PID;
conn = establish_db_connection(config_file_options.conninfo, true);
if (runtime_options.disable_wal_receiver == true)
{
wal_receiver_pid = disable_wal_receiver(conn);
PQfinish(conn);
if (wal_receiver_pid == UNKNOWN_PID)
exit(ERR_BAD_CONFIG);
exit(SUCCESS);
}
if (runtime_options.enable_wal_receiver == true)
{
wal_receiver_pid = enable_wal_receiver(conn, true);
PQfinish(conn);
if (wal_receiver_pid == UNKNOWN_PID)
exit(ERR_BAD_CONFIG);
exit(SUCCESS);
}
log_error(_("no option provided"));
PQfinish(conn);
}
/* /*
* For "internal" use by `node rejoin` on the local node when * For "internal" use by `node rejoin` on the local node when
* called by "standby switchover" from the remote node. * called by "standby switchover" from the remote node.

View File

@@ -24,6 +24,7 @@ extern void do_node_check(void);
extern void do_node_rejoin(void); extern void do_node_rejoin(void);
extern void do_node_service(void); extern void do_node_service(void);
extern void do_node_control(void);
extern void do_node_help(void); extern void do_node_help(void);

View File

@@ -605,7 +605,6 @@ do_standby_clone(void)
log_error(_("unknown clone mode")); log_error(_("unknown clone mode"));
} }
/* If the backup failed then exit */ /* If the backup failed then exit */
if (r != SUCCESS) if (r != SUCCESS)
{ {
@@ -2010,7 +2009,7 @@ do_standby_promote(void)
init_replication_info(&replication_info); init_replication_info(&replication_info);
if (get_replication_info(conn, &replication_info) == false) if (get_replication_info(conn, STANDBY, &replication_info) == false)
{ {
log_error(_("unable to retrieve replication information from local node")); log_error(_("unable to retrieve replication information from local node"));
PQfinish(conn); PQfinish(conn);
@@ -3263,7 +3262,7 @@ do_standby_switchover(void)
ReplInfo replication_info; ReplInfo replication_info;
init_replication_info(&replication_info); init_replication_info(&replication_info);
if (get_replication_info(local_conn, &replication_info) == false) if (get_replication_info(local_conn, STANDBY, &replication_info) == false)
{ {
log_error(_("unable to retrieve replication information from local node")); log_error(_("unable to retrieve replication information from local node"));
PQfinish(local_conn); PQfinish(local_conn);
@@ -3403,6 +3402,7 @@ do_standby_switchover(void)
command_success = remote_command(remote_host, command_success = remote_command(remote_host,
runtime_options.remote_user, runtime_options.remote_user,
remote_command_str.data, remote_command_str.data,
config_file_options.ssh_options,
&command_output); &command_output);
termPQExpBuffer(&remote_command_str); termPQExpBuffer(&remote_command_str);
@@ -3466,6 +3466,7 @@ do_standby_switchover(void)
command_success = remote_command(remote_host, command_success = remote_command(remote_host,
runtime_options.remote_user, runtime_options.remote_user,
remote_command_str.data, remote_command_str.data,
config_file_options.ssh_options,
&command_output); &command_output);
termPQExpBuffer(&remote_command_str); termPQExpBuffer(&remote_command_str);
@@ -3693,6 +3694,7 @@ do_standby_switchover(void)
command_success = remote_command(remote_host, command_success = remote_command(remote_host,
runtime_options.remote_user, runtime_options.remote_user,
remote_command_str.data, remote_command_str.data,
config_file_options.ssh_options,
&command_output); &command_output);
termPQExpBuffer(&remote_command_str); termPQExpBuffer(&remote_command_str);
@@ -3745,6 +3747,7 @@ do_standby_switchover(void)
command_success = remote_command(remote_host, command_success = remote_command(remote_host,
runtime_options.remote_user, runtime_options.remote_user,
remote_command_str.data, remote_command_str.data,
config_file_options.ssh_options,
&command_output); &command_output);
termPQExpBuffer(&remote_command_str); termPQExpBuffer(&remote_command_str);
@@ -4174,6 +4177,7 @@ do_standby_switchover(void)
(void) remote_command(remote_host, (void) remote_command(remote_host,
runtime_options.remote_user, runtime_options.remote_user,
remote_command_str.data, remote_command_str.data,
config_file_options.ssh_options,
&command_output); &command_output);
termPQExpBuffer(&remote_command_str); termPQExpBuffer(&remote_command_str);
@@ -4242,6 +4246,7 @@ do_standby_switchover(void)
command_success = remote_command(remote_host, command_success = remote_command(remote_host,
runtime_options.remote_user, runtime_options.remote_user,
remote_command_str.data, remote_command_str.data,
config_file_options.ssh_options,
&command_output); &command_output);
termPQExpBuffer(&remote_command_str); termPQExpBuffer(&remote_command_str);
@@ -4321,7 +4326,7 @@ do_standby_switchover(void)
for (i = 0; i < config_file_options.wal_receive_check_timeout; i++) for (i = 0; i < config_file_options.wal_receive_check_timeout; i++)
{ {
get_replication_info(local_conn, &replication_info); get_replication_info(local_conn, STANDBY, &replication_info);
if (replication_info.last_wal_receive_lsn >= remote_last_checkpoint_lsn) if (replication_info.last_wal_receive_lsn >= remote_last_checkpoint_lsn)
break; break;
@@ -4462,6 +4467,7 @@ do_standby_switchover(void)
command_success = remote_command(remote_host, command_success = remote_command(remote_host,
runtime_options.remote_user, runtime_options.remote_user,
remote_command_str.data, remote_command_str.data,
config_file_options.ssh_options,
&command_output); &command_output);
termPQExpBuffer(&remote_command_str); termPQExpBuffer(&remote_command_str);
@@ -4570,6 +4576,7 @@ do_standby_switchover(void)
success = remote_command(host, success = remote_command(host,
runtime_options.remote_user, runtime_options.remote_user,
remote_command_str.data, remote_command_str.data,
config_file_options.ssh_options,
&command_output); &command_output);
termPQExpBuffer(&remote_command_str); termPQExpBuffer(&remote_command_str);
@@ -5794,6 +5801,12 @@ run_basebackup(t_node_info *node_record)
if (r != 0) if (r != 0)
return ERR_BAD_BASEBACKUP; return ERR_BAD_BASEBACKUP;
/* check connections are still available */
(void)connection_ping_reconnect(primary_conn);
if (source_conn != primary_conn)
(void)connection_ping_reconnect(source_conn);
/* /*
* If replication slots in use, check the created slot is on the correct * If replication slots in use, check the created slot is on the correct
* node; the slot will initially get created on the source node, and will * node; the slot will initially get created on the source node, and will
@@ -6396,6 +6409,15 @@ stop_backup:
RecordStatus record_status = RECORD_NOT_FOUND; RecordStatus record_status = RECORD_NOT_FOUND;
PGconn *upstream_conn = NULL; PGconn *upstream_conn = NULL;
/* check connections are still available */
(void)connection_ping_reconnect(primary_conn);
if (source_conn != primary_conn)
(void)connection_ping_reconnect(source_conn);
(void)connection_ping_reconnect(source_conn);
record_status = get_node_record(source_conn, upstream_node_id, &upstream_node_record); record_status = get_node_record(source_conn, upstream_node_id, &upstream_node_record);
if (record_status != RECORD_FOUND) if (record_status != RECORD_FOUND)

View File

@@ -135,6 +135,8 @@ typedef struct
/* following options for internal use */ /* following options for internal use */
char config_archive_dir[MAXPGPATH]; char config_archive_dir[MAXPGPATH];
OutputMode output_mode; OutputMode output_mode;
bool disable_wal_receiver;
bool enable_wal_receiver;
} t_runtime_options; } t_runtime_options;
#define T_RUNTIME_OPTIONS_INITIALIZER { \ #define T_RUNTIME_OPTIONS_INITIALIZER { \
@@ -174,7 +176,7 @@ typedef struct
/* "cluster cleanup" options */ \ /* "cluster cleanup" options */ \
0, \ 0, \
/* following options for internal use */ \ /* following options for internal use */ \
"/tmp", OM_TEXT \ "/tmp", OM_TEXT, false, false \
} }
@@ -224,8 +226,6 @@ extern int check_server_version(PGconn *conn, char *server_type, bool exit_on_er
extern void check_93_config(void); extern void check_93_config(void);
extern bool create_repmgr_extension(PGconn *conn); extern bool create_repmgr_extension(PGconn *conn);
extern int test_ssh_connection(char *host, char *remote_user); extern int test_ssh_connection(char *host, char *remote_user);
extern bool local_command(const char *command, PQExpBufferData *outputbuf);
extern bool local_command_simple(const char *command, PQExpBufferData *outputbuf);
extern standy_clone_mode get_standby_clone_mode(void); extern standy_clone_mode get_standby_clone_mode(void);
@@ -238,8 +238,6 @@ extern char *make_pg_path(const char *file);
extern void get_superuser_connection(PGconn **conn, PGconn **superuser_conn, PGconn **privileged_conn); extern void get_superuser_connection(PGconn **conn, PGconn **superuser_conn, PGconn **privileged_conn);
extern bool remote_command(const char *host, const char *user, const char *command, PQExpBufferData *outputbuf);
extern void make_remote_repmgr_path(PQExpBufferData *outputbuf, t_node_info *remote_node_record); extern void make_remote_repmgr_path(PQExpBufferData *outputbuf, t_node_info *remote_node_record);
extern void make_repmgrd_path(PQExpBufferData *output_buf); extern void make_repmgrd_path(PQExpBufferData *output_buf);

View File

@@ -31,6 +31,7 @@
* NODE CHECK * NODE CHECK
* NODE REJOIN * NODE REJOIN
* NODE SERVICE * NODE SERVICE
* NODE CONTROL
* *
* DAEMON STATUS * DAEMON STATUS
* DAEMON PAUSE * DAEMON PAUSE
@@ -97,8 +98,6 @@ t_node_info target_node_info = T_NODE_INFO_INITIALIZER;
static ItemList cli_errors = {NULL, NULL}; static ItemList cli_errors = {NULL, NULL};
static ItemList cli_warnings = {NULL, NULL}; static ItemList cli_warnings = {NULL, NULL};
static bool _local_command(const char *command, PQExpBufferData *outputbuf, bool simple);
int int
main(int argc, char **argv) main(int argc, char **argv)
{ {
@@ -626,7 +625,7 @@ main(int argc, char **argv)
break; break;
/*-------------- /*---------------
* output options * output options
*--------------- *---------------
*/ */
@@ -642,6 +641,19 @@ main(int argc, char **argv)
runtime_options.optformat = true; runtime_options.optformat = true;
break; break;
/*---------------------------------
* undocumented options for testing
*----------------------------------
*/
case OPT_DISABLE_WAL_RECEIVER:
runtime_options.disable_wal_receiver = true;
break;
case OPT_ENABLE_WAL_RECEIVER:
runtime_options.enable_wal_receiver = true;
break;
/*----------------------------- /*-----------------------------
* options deprecated since 3.3 * options deprecated since 3.3
*----------------------------- *-----------------------------
@@ -914,6 +926,8 @@ main(int argc, char **argv)
action = NODE_REJOIN; action = NODE_REJOIN;
else if (strcasecmp(repmgr_action, "SERVICE") == 0) else if (strcasecmp(repmgr_action, "SERVICE") == 0)
action = NODE_SERVICE; action = NODE_SERVICE;
else if (strcasecmp(repmgr_action, "CONTROL") == 0)
action = NODE_CONTROL;
} }
else if (strcasecmp(repmgr_command, "CLUSTER") == 0) else if (strcasecmp(repmgr_command, "CLUSTER") == 0)
@@ -1337,6 +1351,9 @@ main(int argc, char **argv)
case NODE_SERVICE: case NODE_SERVICE:
do_node_service(); do_node_service();
break; break;
case NODE_CONTROL:
do_node_control();
break;
/* CLUSTER */ /* CLUSTER */
case CLUSTER_SHOW: case CLUSTER_SHOW:
@@ -1905,6 +1922,28 @@ check_cli_parameters(const int action)
action_name(action)); action_name(action));
} }
} }
/* --disable-wal-receiver / --enable-wal-receiver */
if (runtime_options.disable_wal_receiver == true || runtime_options.enable_wal_receiver == true)
{
switch (action)
{
case NODE_CONTROL:
{
if (runtime_options.disable_wal_receiver == true && runtime_options.enable_wal_receiver == true)
{
item_list_append(&cli_errors,
_("provide either --disable-wal-receiver or --enable-wal-receiver"));
}
}
break;
default:
item_list_append_format(&cli_warnings,
_("--disable-wal-receiver / --enable-wal-receiver not effective when executing %s"),
action_name(action));
}
}
} }
@@ -2399,75 +2438,6 @@ test_ssh_connection(char *host, char *remote_user)
/*
* Execute a command locally. "outputbuf" should either be an
* initialised PQexpbuffer, or NULL
*/
bool
local_command(const char *command, PQExpBufferData *outputbuf)
{
return _local_command(command, outputbuf, false);
}
bool
local_command_simple(const char *command, PQExpBufferData *outputbuf)
{
return _local_command(command, outputbuf, true);
}
static bool
_local_command(const char *command, PQExpBufferData *outputbuf, bool simple)
{
FILE *fp = NULL;
char output[MAXLEN];
int retval = 0;
bool success;
log_verbose(LOG_DEBUG, "executing:\n %s", command);
if (outputbuf == NULL)
{
retval = system(command);
return (retval == 0) ? true : false;
}
fp = popen(command, "r");
if (fp == NULL)
{
log_error(_("unable to execute local command:\n%s"), command);
return false;
}
while (fgets(output, MAXLEN, fp) != NULL)
{
appendPQExpBuffer(outputbuf, "%s", output);
if (!feof(fp) && simple == false)
{
break;
}
}
retval = pclose(fp);
/* */
success = (WEXITSTATUS(retval) == 0 || WEXITSTATUS(retval) == 141) ? true : false;
log_verbose(LOG_DEBUG, "result of command was %i (%i)", WEXITSTATUS(retval), retval);
if (outputbuf->data != NULL && outputbuf->data[0] != '\0')
log_verbose(LOG_DEBUG, "local_command(): output returned was:\n%s", outputbuf->data);
else
log_verbose(LOG_DEBUG, "local_command(): no output returned");
return success;
}
/* /*
* get_superuser_connection() * get_superuser_connection()
* *
@@ -2674,78 +2644,6 @@ copy_remote_files(char *host, char *remote_user, char *remote_path,
} }
/*
* Execute a command via ssh on the remote host.
*
* TODO: implement SSH calls using libssh2.
*/
bool
remote_command(const char *host, const char *user, const char *command, PQExpBufferData *outputbuf)
{
FILE *fp;
char ssh_command[MAXLEN] = "";
PQExpBufferData ssh_host;
char output[MAXLEN] = "";
initPQExpBuffer(&ssh_host);
if (*user != '\0')
{
appendPQExpBuffer(&ssh_host, "%s@", user);
}
appendPQExpBuffer(&ssh_host, "%s", host);
maxlen_snprintf(ssh_command,
"ssh -o Batchmode=yes %s %s %s",
config_file_options.ssh_options,
ssh_host.data,
command);
termPQExpBuffer(&ssh_host);
log_debug("remote_command():\n %s", ssh_command);
fp = popen(ssh_command, "r");
if (fp == NULL)
{
log_error(_("unable to execute remote command:\n %s"), ssh_command);
return false;
}
if (outputbuf != NULL)
{
/* TODO: better error handling */
while (fgets(output, MAXLEN, fp) != NULL)
{
appendPQExpBuffer(outputbuf, "%s", output);
}
}
else
{
while (fgets(output, MAXLEN, fp) != NULL)
{
if (!feof(fp))
{
break;
}
}
}
pclose(fp);
if (outputbuf != NULL)
{
if (outputbuf->data != NULL && outputbuf->data[0] != '\0')
log_verbose(LOG_DEBUG, "remote_command(): output returned was:\n%s", outputbuf->data);
else
log_verbose(LOG_DEBUG, "remote_command(): no output returned");
}
return true;
}
void void

View File

@@ -40,16 +40,17 @@
#define NODE_CHECK 14 #define NODE_CHECK 14
#define NODE_SERVICE 15 #define NODE_SERVICE 15
#define NODE_REJOIN 16 #define NODE_REJOIN 16
#define CLUSTER_SHOW 17 #define NODE_CONTROL 17
#define CLUSTER_CLEANUP 18 #define CLUSTER_SHOW 18
#define CLUSTER_MATRIX 19 #define CLUSTER_CLEANUP 19
#define CLUSTER_CROSSCHECK 20 #define CLUSTER_MATRIX 20
#define CLUSTER_EVENT 21 #define CLUSTER_CROSSCHECK 21
#define DAEMON_STATUS 22 #define CLUSTER_EVENT 22
#define DAEMON_PAUSE 23 #define DAEMON_STATUS 23
#define DAEMON_UNPAUSE 24 #define DAEMON_PAUSE 24
#define DAEMON_START 25 #define DAEMON_UNPAUSE 25
#define DAEMON_STOP 26 #define DAEMON_START 26
#define DAEMON_STOP 27
/* command line options without short versions */ /* command line options without short versions */
#define OPT_HELP 1001 #define OPT_HELP 1001
@@ -97,7 +98,8 @@
#define OPT_VERSION_NUMBER 1043 #define OPT_VERSION_NUMBER 1043
#define OPT_DATA_DIRECTORY_CONFIG 1044 #define OPT_DATA_DIRECTORY_CONFIG 1044
#define OPT_COMPACT 1045 #define OPT_COMPACT 1045
#define OPT_DISABLE_WAL_RECEIVER 1046
#define OPT_ENABLE_WAL_RECEIVER 1047
/* deprecated since 3.3 */ /* deprecated since 3.3 */
#define OPT_DATA_DIR 999 #define OPT_DATA_DIR 999
@@ -202,6 +204,10 @@ static struct option long_options[] =
/* "cluster cleanup" options */ /* "cluster cleanup" options */
{"keep-history", required_argument, NULL, 'k'}, {"keep-history", required_argument, NULL, 'k'},
/* undocumented options for testing */
{"disable-wal-receiver", no_argument, NULL, OPT_DISABLE_WAL_RECEIVER},
{"enable-wal-receiver", no_argument, NULL, OPT_ENABLE_WAL_RECEIVER},
/* deprecated */ /* deprecated */
{"check-upstream-config", no_argument, NULL, OPT_CHECK_UPSTREAM_CONFIG}, {"check-upstream-config", no_argument, NULL, OPT_CHECK_UPSTREAM_CONFIG},
{"no-conninfo-password", no_argument, NULL, OPT_NO_CONNINFO_PASSWORD}, {"no-conninfo-password", no_argument, NULL, OPT_NO_CONNINFO_PASSWORD},

View File

@@ -53,6 +53,7 @@
#include "voting.h" #include "voting.h"
#define UNKNOWN_NODE_ID -1 #define UNKNOWN_NODE_ID -1
#define ELECTION_RERUN_NOTIFICATION -2
#define UNKNOWN_PID -1 #define UNKNOWN_PID -1
#define TRANCHE_NAME "repmgrd" #define TRANCHE_NAME "repmgrd"
@@ -77,7 +78,7 @@ typedef struct repmgrdSharedState
char repmgrd_pidfile[MAXPGPATH]; char repmgrd_pidfile[MAXPGPATH];
bool repmgrd_paused; bool repmgrd_paused;
/* streaming failover */ /* streaming failover */
TimestampTz primary_last_seen; TimestampTz upstream_last_seen;
NodeVotingStatus voting_status; NodeVotingStatus voting_status;
int current_electoral_term; int current_electoral_term;
int candidate_node_id; int candidate_node_id;
@@ -108,11 +109,11 @@ PG_FUNCTION_INFO_V1(standby_set_last_updated);
Datum standby_get_last_updated(PG_FUNCTION_ARGS); Datum standby_get_last_updated(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(standby_get_last_updated); PG_FUNCTION_INFO_V1(standby_get_last_updated);
Datum set_primary_last_seen(PG_FUNCTION_ARGS); Datum set_upstream_last_seen(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(set_primary_last_seen); PG_FUNCTION_INFO_V1(set_upstream_last_seen);
Datum get_primary_last_seen(PG_FUNCTION_ARGS); Datum get_upstream_last_seen(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(get_primary_last_seen); PG_FUNCTION_INFO_V1(get_upstream_last_seen);
Datum notify_follow_primary(PG_FUNCTION_ARGS); Datum notify_follow_primary(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(notify_follow_primary); PG_FUNCTION_INFO_V1(notify_follow_primary);
@@ -147,6 +148,8 @@ PG_FUNCTION_INFO_V1(repmgrd_pause);
Datum repmgrd_is_paused(PG_FUNCTION_ARGS); Datum repmgrd_is_paused(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(repmgrd_is_paused); PG_FUNCTION_INFO_V1(repmgrd_is_paused);
Datum get_wal_receiver_pid(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(get_wal_receiver_pid);
/* /*
@@ -226,7 +229,7 @@ repmgr_shmem_startup(void)
shared_state->repmgrd_paused = false; shared_state->repmgrd_paused = false;
shared_state->current_electoral_term = 0; shared_state->current_electoral_term = 0;
/* arbitrary "magic" date to indicate this field hasn't been updated */ /* arbitrary "magic" date to indicate this field hasn't been updated */
shared_state->primary_last_seen = POSTGRES_EPOCH_JDATE; shared_state->upstream_last_seen = POSTGRES_EPOCH_JDATE;
shared_state->voting_status = VS_NO_VOTE; shared_state->voting_status = VS_NO_VOTE;
shared_state->candidate_node_id = UNKNOWN_NODE_ID; shared_state->candidate_node_id = UNKNOWN_NODE_ID;
shared_state->follow_new_primary = false; shared_state->follow_new_primary = false;
@@ -363,17 +366,14 @@ standby_get_last_updated(PG_FUNCTION_ARGS)
Datum Datum
set_primary_last_seen(PG_FUNCTION_ARGS) set_upstream_last_seen(PG_FUNCTION_ARGS)
{ {
if (!shared_state) if (!shared_state)
PG_RETURN_VOID(); PG_RETURN_VOID();
LWLockAcquire(shared_state->lock, LW_EXCLUSIVE); LWLockAcquire(shared_state->lock, LW_EXCLUSIVE);
shared_state->primary_last_seen = GetCurrentTimestamp(); shared_state->upstream_last_seen = GetCurrentTimestamp();
elog(INFO,
"primary_last_seen: %s",
timestamptz_to_str( shared_state->primary_last_seen));
LWLockRelease(shared_state->lock); LWLockRelease(shared_state->lock);
@@ -382,7 +382,7 @@ set_primary_last_seen(PG_FUNCTION_ARGS)
Datum Datum
get_primary_last_seen(PG_FUNCTION_ARGS) get_upstream_last_seen(PG_FUNCTION_ARGS)
{ {
long secs; long secs;
int microsecs; int microsecs;
@@ -391,13 +391,9 @@ get_primary_last_seen(PG_FUNCTION_ARGS)
if (!shared_state) if (!shared_state)
PG_RETURN_INT32(-1); PG_RETURN_INT32(-1);
/* A primary is always visible */
if (!RecoveryInProgress())
PG_RETURN_INT32(0);
LWLockAcquire(shared_state->lock, LW_SHARED); LWLockAcquire(shared_state->lock, LW_SHARED);
last_seen = shared_state->primary_last_seen; last_seen = shared_state->upstream_last_seen;
LWLockRelease(shared_state->lock); LWLockRelease(shared_state->lock);
@@ -441,10 +437,18 @@ notify_follow_primary(PG_FUNCTION_ARGS)
/* only do something if local_node_id is initialised */ /* only do something if local_node_id is initialised */
if (shared_state->local_node_id != UNKNOWN_NODE_ID) if (shared_state->local_node_id != UNKNOWN_NODE_ID)
{
if (primary_node_id == ELECTION_RERUN_NOTIFICATION)
{
elog(INFO, "node %i received notification to rerun promotion candidate election",
shared_state->local_node_id);
}
else
{ {
elog(INFO, "node %i received notification to follow node %i", elog(INFO, "node %i received notification to follow node %i",
shared_state->local_node_id, shared_state->local_node_id,
primary_node_id); primary_node_id);
}
LWLockRelease(shared_state->lock); LWLockRelease(shared_state->lock);
LWLockAcquire(shared_state->lock, LW_EXCLUSIVE); LWLockAcquire(shared_state->lock, LW_EXCLUSIVE);
@@ -743,3 +747,17 @@ repmgrd_is_paused(PG_FUNCTION_ARGS)
PG_RETURN_BOOL(is_paused); PG_RETURN_BOOL(is_paused);
} }
Datum
get_wal_receiver_pid(PG_FUNCTION_ARGS)
{
int wal_receiver_pid;
if (!shared_state)
PG_RETURN_NULL();
wal_receiver_pid = WalRcv->pid;
PG_RETURN_INT32(wal_receiver_pid);
}

View File

@@ -5,7 +5,13 @@
# Some configuration items will be set with a default value; this # Some configuration items will be set with a default value; this
# is noted for each item. Where no default value is shown, the # is noted for each item. Where no default value is shown, the
# parameter will be treated as empty or false. # parameter will be treated as empty or false.
#
# IMPORTANT: string values can be provided as-is, or enclosed in single quotes
# (but not double-quotes, which will be interpreted as part of the string), e.g.:
#
# node_name=foo
# node_name = 'foo'
#
# ============================================================================= # =============================================================================
# Required configuration items # Required configuration items
# ============================================================================= # =============================================================================
@@ -281,10 +287,13 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
# manual attention to reattach it to replication # manual attention to reattach it to replication
# (does not apply to BDR mode) # (does not apply to BDR mode)
#priority=100 # indicate a preferred priority for promoting nodes; #priority=100 # indicates a preferred priority for promoting nodes;
# a value of zero prevents the node being promoted to primary # a value of zero prevents the node being promoted to primary
# (default: 100) # (default: 100)
#connection_check_type=ping # How to check availability of the upstream node; valid options:
# 'ping': use PQping() to check if the node is accepting connections
# 'connection': execute a throwaway query on the current connection
#reconnect_attempts=6 # Number of attempts which will be made to reconnect to an unreachable #reconnect_attempts=6 # Number of attempts which will be made to reconnect to an unreachable
# primary (or other upstream node) # primary (or other upstream node)
#reconnect_interval=10 # Interval between attempts to reconnect to an unreachable #reconnect_interval=10 # Interval between attempts to reconnect to an unreachable
@@ -308,7 +317,7 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
#monitoring_history=no # Whether to write monitoring data to the "montoring_history" table #monitoring_history=no # Whether to write monitoring data to the "montoring_history" table
#monitor_interval_secs=2 # Interval (in seconds) at which to write monitoring data #monitor_interval_secs=2 # Interval (in seconds) at which to write monitoring data
#degraded_monitoring_timeout=-1 # Interval (in seconds) after which repmgrd will terminate if the #degraded_monitoring_timeout=-1 # Interval (in seconds) after which repmgrd will terminate if the
# server being monitored is no longer available. -1 (default) # server(s) being monitored are no longer available. -1 (default)
# disables the timeout completely. # disables the timeout completely.
#async_query_timeout=60 # Interval (in seconds) which repmgrd will wait before #async_query_timeout=60 # Interval (in seconds) which repmgrd will wait before
# cancelling an asynchronous query. # cancelling an asynchronous query.
@@ -319,6 +328,18 @@ ssh_options='-q -o ConnectTimeout=10' # Options to append to "ssh"
# "--no-pid-file" will force PID file creation to be skipped. # "--no-pid-file" will force PID file creation to be skipped.
# Note: there is normally no need to set this, particularly if # Note: there is normally no need to set this, particularly if
# repmgr was installed from packages. # repmgr was installed from packages.
#standby_disconnect_on_failover=false # If "true", in a failover situation wait for all standbys to
# disconnect their WAL receivers before electing a new primary
# (PostgreSQL 9.5 and later only; repmgr user must be a superuser for this)
#sibling_nodes_disconnect_timeout=30 # If "standby_disconnect_on_failover" is true, the maximum length of time
# (in seconds) to wait for other standbys to confirm they have disconnected their
# WAL receivers
#failover_validation_command= # Script to execute for an external mechanism to validate the failover
# decision made by repmgrd. One or both of the following parameter placeholders
# should be provided, which will be replaced by repmgrd with the appropriate
# value: %n (node_id), %a (node_name). *Must* be the same on all nodes.
#election_rerun_interval=15 # if "failover_validation_command" is set, and the command returns
# an error, pause the specified amount of seconds before rerunning the election.
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# service control commands # service control commands

View File

@@ -41,6 +41,7 @@
#include "configfile.h" #include "configfile.h"
#include "dbutils.h" #include "dbutils.h"
#include "log.h" #include "log.h"
#include "sysutils.h"
#define MIN_SUPPORTED_VERSION "9.3" #define MIN_SUPPORTED_VERSION "9.3"
#define MIN_SUPPORTED_VERSION_NUM 90300 #define MIN_SUPPORTED_VERSION_NUM 90300
@@ -59,8 +60,10 @@
#define NO_UPSTREAM_NODE -1 #define NO_UPSTREAM_NODE -1
#define UNKNOWN_NODE_ID -1 #define UNKNOWN_NODE_ID -1
#define MIN_NODE_ID 1 #define MIN_NODE_ID 1
#define ELECTION_RERUN_NOTIFICATION -2
#define VOTING_TERM_NOT_SET -1 #define VOTING_TERM_NOT_SET -1
#define ARCHIVE_STATUS_DIR_ERROR -1 #define ARCHIVE_STATUS_DIR_ERROR -1
#define NO_DEGRADED_MONITORING_ELAPSED -1
#define BDR2_REPLICATION_SET_NAME "repmgr" #define BDR2_REPLICATION_SET_NAME "repmgr"
@@ -90,6 +93,10 @@
#define DEFAULT_STANDBY_RECONNECT_TIMEOUT 60 /* seconds */ #define DEFAULT_STANDBY_RECONNECT_TIMEOUT 60 /* seconds */
#define DEFAULT_NODE_REJOIN_TIMEOUT 60 /* seconds */ #define DEFAULT_NODE_REJOIN_TIMEOUT 60 /* seconds */
#define DEFAULT_WAL_RECEIVE_CHECK_TIMEOUT 30 /* seconds */ #define DEFAULT_WAL_RECEIVE_CHECK_TIMEOUT 30 /* seconds */
#define DEFAULT_SIBLING_NODES_DISCONNECT_TIMEOUT 30 /* seconds */
#define DEFAULT_ELECTION_RERUN_INTERVAL 15 /* seconds */
#define WALRECEIVER_DISABLE_TIMEOUT_VALUE 86400000 /* milliseconds */
#ifndef RECOVERY_COMMAND_FILE #ifndef RECOVERY_COMMAND_FILE
#define RECOVERY_COMMAND_FILE "recovery.conf" #define RECOVERY_COMMAND_FILE "recovery.conf"

View File

@@ -1,3 +1,3 @@
#define REPMGR_VERSION_DATE "" #define REPMGR_VERSION_DATE ""
#define REPMGR_VERSION "4.3dev" #define REPMGR_VERSION "4.3rc1"
#define REPMGR_VERSION_NUM 40300 #define REPMGR_VERSION_NUM 40300

View File

@@ -23,7 +23,6 @@
#include "repmgrd.h" #include "repmgrd.h"
#include "repmgrd-physical.h" #include "repmgrd-physical.h"
typedef enum typedef enum
{ {
FAILOVER_STATE_UNKNOWN = -1, FAILOVER_STATE_UNKNOWN = -1,
@@ -38,7 +37,8 @@ typedef enum
FAILOVER_STATE_FOLLOWING_ORIGINAL_PRIMARY, FAILOVER_STATE_FOLLOWING_ORIGINAL_PRIMARY,
FAILOVER_STATE_NO_NEW_PRIMARY, FAILOVER_STATE_NO_NEW_PRIMARY,
FAILOVER_STATE_FOLLOW_FAIL, FAILOVER_STATE_FOLLOW_FAIL,
FAILOVER_STATE_NODE_NOTIFICATION_ERROR FAILOVER_STATE_NODE_NOTIFICATION_ERROR,
FAILOVER_STATE_ELECTION_RERUN
} FailoverState; } FailoverState;
@@ -47,7 +47,8 @@ typedef enum
ELECTION_NOT_CANDIDATE = -1, ELECTION_NOT_CANDIDATE = -1,
ELECTION_WON, ELECTION_WON,
ELECTION_LOST, ELECTION_LOST,
ELECTION_CANCELLED ELECTION_CANCELLED,
ELECTION_RERUN
} ElectionResult; } ElectionResult;
@@ -58,12 +59,11 @@ static FailoverState failover_state = FAILOVER_STATE_UNKNOWN;
static int primary_node_id = UNKNOWN_NODE_ID; static int primary_node_id = UNKNOWN_NODE_ID;
static t_node_info upstream_node_info = T_NODE_INFO_INITIALIZER; static t_node_info upstream_node_info = T_NODE_INFO_INITIALIZER;
static NodeInfoList sibling_nodes = T_NODE_INFO_LIST_INITIALIZER;
static instr_time last_monitoring_update; static instr_time last_monitoring_update;
static ElectionResult do_election(void); static ElectionResult do_election(NodeInfoList *sibling_nodes);
static const char *_print_election_result(ElectionResult result); static const char *_print_election_result(ElectionResult result);
static FailoverState promote_self(void); static FailoverState promote_self(void);
@@ -88,7 +88,9 @@ static void update_monitoring_history(void);
static void handle_sighup(PGconn **conn, t_server_type server_type); static void handle_sighup(PGconn **conn, t_server_type server_type);
static const char *format_failover_state(FailoverState failover_state); static const char *format_failover_state(FailoverState failover_state);
static const char * format_failover_state(FailoverState failover_state);
static ElectionResult execute_failover_validation_command(t_node_info *node_info);
static void parse_failover_validation_command(const char *template, t_node_info *node_info, PQExpBufferData *out);
void void
handle_sigint_physical(SIGNAL_ARGS) handle_sigint_physical(SIGNAL_ARGS)
@@ -349,7 +351,7 @@ monitor_streaming_primary(void)
* check that the local node is still primary, otherwise switch * check that the local node is still primary, otherwise switch
* to standby monitoring * to standby monitoring
*/ */
if (check_primary_status(-1) == false) if (check_primary_status(NO_DEGRADED_MONITORING_ELAPSED) == false)
return; return;
goto loop; goto loop;
@@ -421,7 +423,7 @@ monitor_streaming_primary(void)
loop: loop:
/* check node is still primary, if not restart monitoring */ /* check node is still primary, if not restart monitoring */
if (check_primary_status(-1) == false) if (check_primary_status(NO_DEGRADED_MONITORING_ELAPSED) == false)
return; return;
/* emit "still alive" log message at regular intervals, if requested */ /* emit "still alive" log message at regular intervals, if requested */
@@ -831,9 +833,9 @@ monitor_streaming_standby(void)
while (true) while (true)
{ {
log_verbose(LOG_DEBUG, "checking %s", upstream_node_info.conninfo); log_verbose(LOG_DEBUG, "checking %s", upstream_node_info.conninfo);
if (is_server_available(upstream_node_info.conninfo) == true) if (check_upstream_connection(&upstream_conn, upstream_node_info.conninfo) == true)
{ {
set_primary_last_seen(local_conn); set_upstream_last_seen(local_conn);
} }
else else
{ {
@@ -1030,8 +1032,9 @@ monitor_streaming_standby(void)
upstream_node_info.node_id, upstream_node_info.node_id,
degraded_monitoring_elapsed); degraded_monitoring_elapsed);
if (is_server_available(upstream_node_info.conninfo) == true) if (check_upstream_connection(&upstream_conn, upstream_node_info.conninfo) == true)
{ {
if (config_file_options.connection_check_type != CHECK_QUERY)
upstream_conn = establish_db_connection(upstream_node_info.conninfo, false); upstream_conn = establish_db_connection(upstream_node_info.conninfo, false);
if (PQstatus(upstream_conn) == CONNECTION_OK) if (PQstatus(upstream_conn) == CONNECTION_OK)
@@ -1107,6 +1110,7 @@ monitor_streaming_standby(void)
{ {
int degraded_monitoring_elapsed; int degraded_monitoring_elapsed;
int former_upstream_node_id = local_node_info.upstream_node_id; int former_upstream_node_id = local_node_info.upstream_node_id;
NodeInfoList sibling_nodes = T_NODE_INFO_LIST_INITIALIZER;
update_node_record_set_primary(local_conn, local_node_info.node_id); update_node_record_set_primary(local_conn, local_node_info.node_id);
record_status = get_node_record(local_conn, local_node_info.node_id, &local_node_info); record_status = get_node_record(local_conn, local_node_info.node_id, &local_node_info);
@@ -1135,6 +1139,8 @@ monitor_streaming_standby(void)
&sibling_nodes); &sibling_nodes);
notify_followers(&sibling_nodes, local_node_info.node_id); notify_followers(&sibling_nodes, local_node_info.node_id);
clear_node_info_list(&sibling_nodes);
/* this will restart monitoring in primary mode */ /* this will restart monitoring in primary mode */
monitoring_state = MS_NORMAL; monitoring_state = MS_NORMAL;
return; return;
@@ -1169,6 +1175,8 @@ monitor_streaming_standby(void)
if (config_file_options.failover == FAILOVER_AUTOMATIC && repmgrd_is_paused(local_conn) == false) if (config_file_options.failover == FAILOVER_AUTOMATIC && repmgrd_is_paused(local_conn) == false)
{ {
NodeInfoList sibling_nodes = T_NODE_INFO_LIST_INITIALIZER;
get_active_sibling_node_records(local_conn, get_active_sibling_node_records(local_conn,
local_node_info.node_id, local_node_info.node_id,
local_node_info.upstream_node_id, local_node_info.upstream_node_id,
@@ -1604,7 +1612,11 @@ monitor_streaming_witness(void)
while (true) while (true)
{ {
if (is_server_available(upstream_node_info.conninfo) == false) if (check_upstream_connection(&primary_conn, upstream_node_info.conninfo) == true)
{
set_upstream_last_seen(local_conn);
}
else
{ {
if (upstream_node_info.node_status == NODE_STATUS_UP) if (upstream_node_info.node_status == NODE_STATUS_UP)
{ {
@@ -1693,8 +1705,9 @@ monitor_streaming_witness(void)
upstream_node_info.node_id, upstream_node_info.node_id,
degraded_monitoring_elapsed); degraded_monitoring_elapsed);
if (is_server_available(upstream_node_info.conninfo) == true) if (check_upstream_connection(&primary_conn, upstream_node_info.conninfo) == true)
{ {
if (config_file_options.connection_check_type != CHECK_QUERY)
primary_conn = establish_db_connection(upstream_node_info.conninfo, false); primary_conn = establish_db_connection(upstream_node_info.conninfo, false);
if (PQstatus(primary_conn) == CONNECTION_OK) if (PQstatus(primary_conn) == CONNECTION_OK)
@@ -1742,6 +1755,7 @@ monitor_streaming_witness(void)
NodeInfoListCell *cell; NodeInfoListCell *cell;
int follow_node_id = UNKNOWN_NODE_ID; int follow_node_id = UNKNOWN_NODE_ID;
NodeInfoList sibling_nodes = T_NODE_INFO_LIST_INITIALIZER;
get_active_sibling_node_records(local_conn, get_active_sibling_node_records(local_conn,
local_node_info.node_id, local_node_info.node_id,
@@ -1971,25 +1985,128 @@ static bool
do_primary_failover(void) do_primary_failover(void)
{ {
ElectionResult election_result; ElectionResult election_result;
bool final_result = false;
NodeInfoList sibling_nodes = T_NODE_INFO_LIST_INITIALIZER;
/* /*
* Double-check status of the local connection * Double-check status of the local connection
*/ */
check_connection(&local_node_info, &local_conn); check_connection(&local_node_info, &local_conn);
/*
* if requested, disable WAL receiver and wait until WAL receivers on all
* sibling nodes are disconnected
*/
if (config_file_options.standby_disconnect_on_failover == true)
{
NodeInfoListCell *cell = NULL;
NodeInfoList check_sibling_nodes = T_NODE_INFO_LIST_INITIALIZER;
int i;
bool sibling_node_wal_receiver_connected = false;
if (PQserverVersion(local_conn) < 90500)
{
log_warning(_("\"standby_disconnect_on_failover\" specified, but not available for this PostgreSQL version"));
/* TODO: format server version */
log_detail(_("available from PostgreSQL 9.5, this PostgreSQL version is %i"), PQserverVersion(local_conn));
}
else
{
disable_wal_receiver(local_conn);
/*
* Loop through all reachable sibling nodes to determine whether
* they have disabled their WAL receivers.
*
* TODO: do_election() also calls get_active_sibling_node_records(),
* consolidate calls if feasible
*
*/
get_active_sibling_node_records(local_conn,
local_node_info.node_id,
local_node_info.upstream_node_id,
&check_sibling_nodes);
for (i = 0; i < config_file_options.sibling_nodes_disconnect_timeout; i++)
{
for (cell = check_sibling_nodes.head; cell; cell = cell->next)
{
pid_t sibling_wal_receiver_pid;
if (cell->node_info->conn == NULL)
cell->node_info->conn = establish_db_connection(cell->node_info->conninfo, false);
sibling_wal_receiver_pid = (pid_t)get_wal_receiver_pid(cell->node_info->conn);
if (sibling_wal_receiver_pid == UNKNOWN_PID)
{
log_warning(_("unable to query WAL receiver PID on node %i"),
cell->node_info->node_id);
}
else if (sibling_wal_receiver_pid > 0)
{
log_info(_("WAL receiver PID on node %i is %i"),
cell->node_info->node_id,
sibling_wal_receiver_pid);
sibling_node_wal_receiver_connected = true;
}
}
if (sibling_node_wal_receiver_connected == false)
{
log_notice(_("WAL receiver disconnected on all sibling nodes"));
break;
}
log_debug("sleeping %i of max %i seconds (\"sibling_nodes_disconnect_timeout\")",
i + 1, config_file_options.sibling_nodes_disconnect_timeout);
sleep(1);
}
if (sibling_node_wal_receiver_connected == true)
{
/* TODO: prevent any such nodes becoming promotion candidates */
log_warning(_("WAL receiver still connected on at least one sibling node"));
}
else
{
log_info(_("WAL receiver disconnected on all %i sibling nodes"),
check_sibling_nodes.node_count);
}
clear_node_info_list(&check_sibling_nodes);
}
}
/* attempt to initiate voting process */ /* attempt to initiate voting process */
election_result = do_election(); election_result = do_election(&sibling_nodes);
/* TODO add pre-event notification here */ /* TODO add pre-event notification here */
failover_state = FAILOVER_STATE_UNKNOWN; failover_state = FAILOVER_STATE_UNKNOWN;
log_debug("election result: %s", _print_election_result(election_result)); log_debug("election result: %s", _print_election_result(election_result));
/* Reenable WAL receiver, if disabled */
if (config_file_options.standby_disconnect_on_failover == true)
{
/* adjust "wal_retrieve_retry_interval" but don't wait for WAL receiver to start */
enable_wal_receiver(local_conn, false);
}
if (election_result == ELECTION_CANCELLED) if (election_result == ELECTION_CANCELLED)
{ {
log_notice(_("election cancelled")); log_notice(_("election cancelled"));
return false; return false;
} }
else if (election_result == ELECTION_RERUN)
{
log_notice(_("promotion candidate election will be rerun"));
/* notify siblings that they should rerun the election too */
notify_followers(&sibling_nodes, ELECTION_RERUN_NOTIFICATION);
failover_state = FAILOVER_STATE_ELECTION_RERUN;
}
else if (election_result == ELECTION_WON) else if (election_result == ELECTION_WON)
{ {
if (sibling_nodes.node_count > 0) if (sibling_nodes.node_count > 0)
@@ -2052,6 +2169,12 @@ do_primary_failover(void)
&sibling_nodes); &sibling_nodes);
} }
/* election rerun */
else if (new_primary_id == ELECTION_RERUN_NOTIFICATION)
{
log_notice(_("received notification from promotion candidate to rerun election"));
failover_state = FAILOVER_STATE_ELECTION_RERUN;
}
else if (config_file_options.failover == FAILOVER_MANUAL) else if (config_file_options.failover == FAILOVER_MANUAL)
{ {
/* automatic failover disabled */ /* automatic failover disabled */
@@ -2113,14 +2236,34 @@ do_primary_failover(void)
/* notify former siblings that they should now follow this node */ /* notify former siblings that they should now follow this node */
notify_followers(&sibling_nodes, local_node_info.node_id); notify_followers(&sibling_nodes, local_node_info.node_id);
/* we no longer care about our former siblings */
clear_node_info_list(&sibling_nodes);
/* pass control back down to start_monitoring() */ /* pass control back down to start_monitoring() */
log_info(_("switching to primary monitoring mode")); log_info(_("switching to primary monitoring mode"));
failover_state = FAILOVER_STATE_NONE; failover_state = FAILOVER_STATE_NONE;
return true;
final_result = true;
break;
case FAILOVER_STATE_ELECTION_RERUN:
/* we no longer care about our former siblings */
clear_node_info_list(&sibling_nodes);
log_notice(_("rerunning election after %i seconds (\"election_rerun_interval\")"),
config_file_options.election_rerun_interval);
sleep(config_file_options.election_rerun_interval);
log_info(_("election rerun will now commence"));
/*
* mark the upstream node as "up" so another election is triggered
* after we fall back to monitoring
*/
upstream_node_info.node_status = NODE_STATUS_UP;
failover_state = FAILOVER_STATE_NONE;
final_result = false;
break;
case FAILOVER_STATE_PRIMARY_REAPPEARED: case FAILOVER_STATE_PRIMARY_REAPPEARED:
@@ -2130,17 +2273,15 @@ do_primary_failover(void)
*/ */
notify_followers(&sibling_nodes, upstream_node_info.node_id); notify_followers(&sibling_nodes, upstream_node_info.node_id);
/* we no longer care about our former siblings */
clear_node_info_list(&sibling_nodes);
/* pass control back down to start_monitoring() */ /* pass control back down to start_monitoring() */
log_info(_("resuming standby monitoring mode")); log_info(_("resuming standby monitoring mode"));
log_detail(_("original primary \"%s\" (node ID: %i) reappeared"), log_detail(_("original primary \"%s\" (node ID: %i) reappeared"),
upstream_node_info.node_name, upstream_node_info.node_id); upstream_node_info.node_name, upstream_node_info.node_id);
failover_state = FAILOVER_STATE_NONE; failover_state = FAILOVER_STATE_NONE;
return true;
final_result = true;
break;
case FAILOVER_STATE_FOLLOWED_NEW_PRIMARY: case FAILOVER_STATE_FOLLOWED_NEW_PRIMARY:
log_info(_("resuming standby monitoring mode")); log_info(_("resuming standby monitoring mode"));
@@ -2148,7 +2289,8 @@ do_primary_failover(void)
upstream_node_info.node_name, upstream_node_info.node_id); upstream_node_info.node_name, upstream_node_info.node_id);
failover_state = FAILOVER_STATE_NONE; failover_state = FAILOVER_STATE_NONE;
return true; final_result = true;
break;
case FAILOVER_STATE_FOLLOWING_ORIGINAL_PRIMARY: case FAILOVER_STATE_FOLLOWING_ORIGINAL_PRIMARY:
log_info(_("resuming standby monitoring mode")); log_info(_("resuming standby monitoring mode"));
@@ -2156,13 +2298,15 @@ do_primary_failover(void)
upstream_node_info.node_name, upstream_node_info.node_id); upstream_node_info.node_name, upstream_node_info.node_id);
failover_state = FAILOVER_STATE_NONE; failover_state = FAILOVER_STATE_NONE;
return true; final_result = true;
break;
case FAILOVER_STATE_PROMOTION_FAILED: case FAILOVER_STATE_PROMOTION_FAILED:
monitoring_state = MS_DEGRADED; monitoring_state = MS_DEGRADED;
INSTR_TIME_SET_CURRENT(degraded_monitoring_start); INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
return false; final_result = false;
break;
case FAILOVER_STATE_FOLLOW_FAIL: case FAILOVER_STATE_FOLLOW_FAIL:
@@ -2173,29 +2317,41 @@ do_primary_failover(void)
monitoring_state = MS_DEGRADED; monitoring_state = MS_DEGRADED;
INSTR_TIME_SET_CURRENT(degraded_monitoring_start); INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
return false; final_result = false;
break;
case FAILOVER_STATE_REQUIRES_MANUAL_FAILOVER: case FAILOVER_STATE_REQUIRES_MANUAL_FAILOVER:
log_info(_("automatic failover disabled for this node, manual intervention required")); log_info(_("automatic failover disabled for this node, manual intervention required"));
monitoring_state = MS_DEGRADED; monitoring_state = MS_DEGRADED;
INSTR_TIME_SET_CURRENT(degraded_monitoring_start); INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
return false;
final_result = false;
break;
case FAILOVER_STATE_NO_NEW_PRIMARY: case FAILOVER_STATE_NO_NEW_PRIMARY:
case FAILOVER_STATE_WAITING_NEW_PRIMARY: case FAILOVER_STATE_WAITING_NEW_PRIMARY:
/* pass control back down to start_monitoring() */ /* pass control back down to start_monitoring() */
return false; final_result = false;
break;
case FAILOVER_STATE_NODE_NOTIFICATION_ERROR: case FAILOVER_STATE_NODE_NOTIFICATION_ERROR:
case FAILOVER_STATE_LOCAL_NODE_FAILURE: case FAILOVER_STATE_LOCAL_NODE_FAILURE:
case FAILOVER_STATE_UNKNOWN: case FAILOVER_STATE_UNKNOWN:
case FAILOVER_STATE_NONE: case FAILOVER_STATE_NONE:
return false;
final_result = false;
break;
default: /* should never reach here */
log_warning(_("unhandled failover state %i"), failover_state);
break;
} }
/* should never reach here */ /* we no longer care about our former siblings */
return false; clear_node_info_list(&sibling_nodes);
return final_result;
} }
@@ -2225,7 +2381,7 @@ update_monitoring_history(void)
init_replication_info(&replication_info); init_replication_info(&replication_info);
if (get_replication_info(local_conn, &replication_info) == false) if (get_replication_info(local_conn, STANDBY, &replication_info) == false)
{ {
log_warning(_("unable to retrieve replication status information, unable to update monitoring history")); log_warning(_("unable to retrieve replication status information, unable to update monitoring history"));
return; return;
@@ -2683,7 +2839,7 @@ notify_followers(NodeInfoList *standby_nodes, int follow_node_id)
{ {
NodeInfoListCell *cell; NodeInfoListCell *cell;
log_verbose(LOG_NOTICE, "%i followers to notify", log_info(_("%i followers to notify"),
standby_nodes->node_count); standby_nodes->node_count);
for (cell = standby_nodes->head; cell; cell = cell->next) for (cell = standby_nodes->head; cell; cell = cell->next)
@@ -2703,8 +2859,19 @@ notify_followers(NodeInfoList *standby_nodes, int follow_node_id)
continue; continue;
} }
log_verbose(LOG_NOTICE, "notifying node %i to follow node %i", if (follow_node_id == ELECTION_RERUN_NOTIFICATION)
cell->node_info->node_id, follow_node_id); {
log_notice(_("notifying node \"%s\" (node ID: %i) to rerun promotion candidate selection"),
cell->node_info->node_name,
cell->node_info->node_id);
}
else
{
log_notice(_("notifying node \"%s\" (node ID: %i) to follow node %i"),
cell->node_info->node_name,
cell->node_info->node_id,
follow_node_id);
}
notify_follow_primary(cell->node_info->conn, follow_node_id); notify_follow_primary(cell->node_info->conn, follow_node_id);
} }
} }
@@ -3064,6 +3231,9 @@ _print_election_result(ElectionResult result)
case ELECTION_CANCELLED: case ELECTION_CANCELLED:
return "CANCELLED"; return "CANCELLED";
case ELECTION_RERUN:
return "RERUN";
} }
/* should never reach here */ /* should never reach here */
@@ -3074,11 +3244,11 @@ _print_election_result(ElectionResult result)
/* /*
* Failover decision for nodes attached to the current primary. * Failover decision for nodes attached to the current primary.
* *
* NB: this function sets standby_nodes; caller (do_primary_failover) * NB: this function sets "sibling_nodes"; caller (do_primary_failover)
* expects to be able to read this list * expects to be able to read this list
*/ */
static ElectionResult static ElectionResult
do_election(void) do_election(NodeInfoList *sibling_nodes)
{ {
int electoral_term = -1; int electoral_term = -1;
@@ -3092,6 +3262,9 @@ do_election(void)
ReplInfo local_replication_info; ReplInfo local_replication_info;
/* To collate details of nodes with primary visible for logging purposes */
PQExpBufferData nodes_with_primary_visible;
/* /*
* Check if at least one server in the primary's location is visible; if * Check if at least one server in the primary's location is visible; if
* not we'll assume a network split between this node and the primary * not we'll assume a network split between this node and the primary
@@ -3103,6 +3276,9 @@ do_election(void)
*/ */
bool primary_location_seen = false; bool primary_location_seen = false;
int nodes_with_primary_still_visible = 0;
electoral_term = get_current_term(local_conn); electoral_term = get_current_term(local_conn);
if (electoral_term == -1) if (electoral_term == -1)
@@ -3137,22 +3313,38 @@ do_election(void)
get_active_sibling_node_records(local_conn, get_active_sibling_node_records(local_conn,
local_node_info.node_id, local_node_info.node_id,
upstream_node_info.node_id, upstream_node_info.node_id,
&sibling_nodes); sibling_nodes);
total_nodes = sibling_nodes.node_count + 1; total_nodes = sibling_nodes->node_count + 1;
log_debug("do_election(): primary location is \"%s\", standby location is \"%s\"", if (strncmp(upstream_node_info.location, local_node_info.location, MAXLEN) != 0)
{
log_info(_("primary node \"%s\" (ID: %i) has location \"%s\", this node's location is \"%s\""),
upstream_node_info.node_name,
upstream_node_info.node_id,
upstream_node_info.location, upstream_node_info.location,
local_node_info.location); local_node_info.location);
}
else
{
log_info(_("primary and this node have the same location (\"%s\")"),
local_node_info.location);
}
local_node_info.last_wal_receive_lsn = InvalidXLogRecPtr; local_node_info.last_wal_receive_lsn = InvalidXLogRecPtr;
/* fast path if no other standbys (or witness) exists - normally win by default */ /* fast path if no other standbys (or witness) exists - normally win by default */
if (sibling_nodes.node_count == 0) if (sibling_nodes->node_count == 0)
{ {
if (strncmp(upstream_node_info.location, local_node_info.location, MAXLEN) == 0) if (strncmp(upstream_node_info.location, local_node_info.location, MAXLEN) == 0)
{ {
log_debug("no other nodes - we win by default"); if (config_file_options.failover_validation_command[0] != '\0')
{
return execute_failover_validation_command(&local_node_info);
}
log_info(_("no other sibling nodes - we win by default"));
return ELECTION_WON; return ELECTION_WON;
} }
else else
@@ -3184,7 +3376,7 @@ do_election(void)
} }
/* get our lsn */ /* get our lsn */
if (get_replication_info(local_conn, &local_replication_info) == false) if (get_replication_info(local_conn, STANDBY, &local_replication_info) == false)
{ {
log_error(_("unable to retrieve replication information for local node")); log_error(_("unable to retrieve replication information for local node"));
return ELECTION_LOST; return ELECTION_LOST;
@@ -3215,13 +3407,14 @@ do_election(void)
local_node_info.last_wal_receive_lsn = local_replication_info.last_wal_receive_lsn; local_node_info.last_wal_receive_lsn = local_replication_info.last_wal_receive_lsn;
log_info(_("local node's last receive lsn: %X/%X"), format_lsn(local_node_info.last_wal_receive_lsn));
log_debug("our last receive lsn: %X/%X", format_lsn(local_node_info.last_wal_receive_lsn));
/* pointer to "winning" node, initially self */ /* pointer to "winning" node, initially self */
candidate_node = &local_node_info; candidate_node = &local_node_info;
for (cell = sibling_nodes.head; cell; cell = cell->next) initPQExpBuffer(&nodes_with_primary_visible);
for (cell = sibling_nodes->head; cell; cell = cell->next)
{ {
ReplInfo sibling_replication_info; ReplInfo sibling_replication_info;
@@ -3251,22 +3444,6 @@ do_election(void)
} }
} }
/* don't interrogate a witness server */
if (cell->node_info->type == WITNESS)
{
log_debug("node %i is witness, not querying state", cell->node_info->node_id);
continue;
}
/* don't check 0-priority nodes */
if (cell->node_info->priority == 0)
{
log_debug("node %i has priority of 0, skipping",
cell->node_info->node_id);
continue;
}
/* /*
* check if repmgrd running - skip if not * check if repmgrd running - skip if not
* *
@@ -3277,14 +3454,16 @@ do_election(void)
*/ */
if (repmgrd_get_pid(cell->node_info->conn) == UNKNOWN_PID) if (repmgrd_get_pid(cell->node_info->conn) == UNKNOWN_PID)
{ {
log_warning(_("repmgrd not running on node %i, skipping"), log_warning(_("repmgrd not running on node \"%s\" (ID: %i), skipping"),
cell->node_info->node_name,
cell->node_info->node_id); cell->node_info->node_id);
continue; continue;
} }
if (get_replication_info(cell->node_info->conn, &sibling_replication_info) == false) if (get_replication_info(cell->node_info->conn, cell->node_info->type, &sibling_replication_info) == false)
{ {
log_warning(_("unable to retrieve replication information for node %i, skipping"), log_warning(_("unable to retrieve replication information for node \"%s\" (ID: %i), skipping"),
cell->node_info->node_name,
cell->node_info->node_id); cell->node_info->node_id);
continue; continue;
} }
@@ -3294,19 +3473,65 @@ do_election(void)
{ {
/* /*
* Theoretically the repmgrd on the node should have resumed WAL play * Theoretically the repmgrd on the node should have resumed WAL play
* at this point * at this point.
*/ */
if (sibling_replication_info.last_wal_receive_lsn > sibling_replication_info.last_wal_replay_lsn) if (sibling_replication_info.last_wal_receive_lsn > sibling_replication_info.last_wal_replay_lsn)
{ {
log_warning(_("WAL replay on node %i is paused and WAL is pending replay"), log_warning(_("WAL replay on node \"%s\" (ID: %i) is paused and WAL is pending replay"),
cell->node_info->node_name,
cell->node_info->node_id); cell->node_info->node_id);
} }
} }
/*
* Check if node has seen primary "recently" - if so, we may have "partial primary visibility".
* For now we'll assume the primary is visible if it's been seen less than
* monitor_interval_secs * 2 seconds ago. We may need to adjust this, and/or make the value
* configurable.
*/
if (sibling_replication_info.upstream_last_seen >= 0 && sibling_replication_info.upstream_last_seen < (config_file_options.monitor_interval_secs * 2))
{
nodes_with_primary_still_visible++;
log_notice(_("node %i last saw primary node %i second(s) ago, considering primary still visible"),
cell->node_info->node_id,
sibling_replication_info.upstream_last_seen);
appendPQExpBuffer(&nodes_with_primary_visible,
" - node \"%s\" (ID: %i): %i second(s) ago\n",
cell->node_info->node_name,
cell->node_info->node_id,
sibling_replication_info.upstream_last_seen);
}
else
{
log_info(_("node %i last saw primary node %i second(s) ago"),
cell->node_info->node_id,
sibling_replication_info.upstream_last_seen);
}
/* don't interrogate a witness server */
if (cell->node_info->type == WITNESS)
{
log_debug("node %i is witness, not querying state", cell->node_info->node_id);
continue;
}
/* don't check 0-priority nodes */
if (cell->node_info->priority <= 0)
{
log_info(_("node %i has priority of %i, skipping"),
cell->node_info->node_id,
cell->node_info->priority);
continue;
}
/* get node's last receive LSN - if "higher" than current winner, current node is candidate */ /* get node's last receive LSN - if "higher" than current winner, current node is candidate */
cell->node_info->last_wal_receive_lsn = sibling_replication_info.last_wal_receive_lsn; cell->node_info->last_wal_receive_lsn = sibling_replication_info.last_wal_receive_lsn;
log_verbose(LOG_DEBUG, "node %i's last receive LSN is: %X/%X", log_info(_("last receive LSN for sibling node \"%s\" (ID: %i) is: %X/%X"),
cell->node_info->node_name,
cell->node_info->node_id, cell->node_info->node_id,
format_lsn(cell->node_info->last_wal_receive_lsn)); format_lsn(cell->node_info->last_wal_receive_lsn));
@@ -3314,8 +3539,10 @@ do_election(void)
if (cell->node_info->last_wal_receive_lsn > candidate_node->last_wal_receive_lsn) if (cell->node_info->last_wal_receive_lsn > candidate_node->last_wal_receive_lsn)
{ {
/* other node is ahead */ /* other node is ahead */
log_verbose(LOG_DEBUG, "node %i is ahead of current candidate %i", log_info(_("node \"%s\" (ID: %i) is ahead of current candidate \"%s\" (ID: %i)"),
cell->node_info->node_name,
cell->node_info->node_id, cell->node_info->node_id,
candidate_node->node_name,
candidate_node->node_id); candidate_node->node_id);
candidate_node = cell->node_info; candidate_node = cell->node_info;
@@ -3323,33 +3550,44 @@ do_election(void)
/* LSN is same - tiebreak on priority, then node_id */ /* LSN is same - tiebreak on priority, then node_id */
else if (cell->node_info->last_wal_receive_lsn == candidate_node->last_wal_receive_lsn) else if (cell->node_info->last_wal_receive_lsn == candidate_node->last_wal_receive_lsn)
{ {
log_verbose(LOG_DEBUG, "node %i has same LSN as current candidate %i", log_info(_("node \"%s\" (ID: %i) has same LSN as current candidate \"%s\" (ID: %i)"),
cell->node_info->node_name,
cell->node_info->node_id, cell->node_info->node_id,
candidate_node->node_name,
candidate_node->node_id); candidate_node->node_id);
if (cell->node_info->priority > candidate_node->priority) if (cell->node_info->priority > candidate_node->priority)
{ {
log_verbose(LOG_DEBUG, "node %i has higher priority (%i) than current candidate %i (%i)", log_info(_("node \"%s\" (ID: %i) has higher priority (%i) than current candidate \"%s\" (ID: %i) (%i)"),
cell->node_info->node_name,
cell->node_info->node_id, cell->node_info->node_id,
cell->node_info->priority, cell->node_info->priority,
candidate_node->node_name,
candidate_node->node_id, candidate_node->node_id,
candidate_node->priority); candidate_node->priority);
candidate_node = cell->node_info; candidate_node = cell->node_info;
} }
else if (cell->node_info->priority == candidate_node->priority) else if (cell->node_info->priority == candidate_node->priority)
{ {
if (cell->node_info->node_id < candidate_node->node_id) if (cell->node_info->node_id < candidate_node->node_id)
{ {
log_verbose(LOG_DEBUG, "node %i has same priority but lower node_id than current candidate %i", log_info(_("node \"%s\" (ID: %i) has same priority but lower node_id than current candidate \"%s\" (ID: %i)"),
cell->node_info->node_name,
cell->node_info->node_id, cell->node_info->node_id,
candidate_node->node_name,
candidate_node->node_id); candidate_node->node_id);
candidate_node = cell->node_info; candidate_node = cell->node_info;
} }
} }
else else
{ {
log_verbose(LOG_DEBUG, "node %i has lower priority (%i) than current candidate %i (%i)", log_info(_("node \"%s\" (ID: %i) has lower priority (%i) than current candidate \"%s\" (ID: %i) (%i)"),
cell->node_info->node_name,
cell->node_info->node_id, cell->node_info->node_id,
cell->node_info->priority, cell->node_info->priority,
candidate_node->node_name,
candidate_node->node_id, candidate_node->node_id,
candidate_node->priority); candidate_node->priority);
} }
@@ -3370,9 +3608,34 @@ do_election(void)
return ELECTION_CANCELLED; return ELECTION_CANCELLED;
} }
log_debug("visible nodes: %i; total nodes: %i", if (nodes_with_primary_still_visible > 0)
{
log_info(_("%i nodes can see the primary"),
nodes_with_primary_still_visible);
log_detail(_("following nodes can see the primary:\n%s"),
nodes_with_primary_visible.data);
if (config_file_options.primary_visibility_consensus == true)
{
log_notice(_("cancelling failover as some nodes can still see the primary"));
monitoring_state = MS_DEGRADED;
INSTR_TIME_SET_CURRENT(degraded_monitoring_start);
reset_node_voting_status();
termPQExpBuffer(&nodes_with_primary_visible);
return ELECTION_CANCELLED;
}
}
termPQExpBuffer(&nodes_with_primary_visible);
log_info(_("visible nodes: %i; total nodes: %i; no nodes have seen the primary within the last %i seconds"),
visible_nodes, visible_nodes,
total_nodes); total_nodes,
(config_file_options.monitor_interval_secs * 2));
if (visible_nodes <= (total_nodes / 2.0)) if (visible_nodes <= (total_nodes / 2.0))
{ {
@@ -3387,9 +3650,24 @@ do_election(void)
return ELECTION_CANCELLED; return ELECTION_CANCELLED;
} }
log_debug("promotion candidate is %i", candidate_node->node_id); log_notice(_("promotion candidate is \"%s\" (ID: %i)"),
candidate_node->node_name,
candidate_node->node_id);
if (candidate_node->node_id == local_node_info.node_id) if (candidate_node->node_id == local_node_info.node_id)
{
/*
* If "failover_validation_command" is set, execute that command
* and decide the result based on the command's output
*/
if (config_file_options.failover_validation_command[0] != '\0')
{
return execute_failover_validation_command(candidate_node);
}
return ELECTION_WON; return ELECTION_WON;
}
return ELECTION_LOST; return ELECTION_LOST;
} }
@@ -3566,6 +3844,8 @@ format_failover_state(FailoverState failover_state)
return "FOLLOW_FAIL"; return "FOLLOW_FAIL";
case FAILOVER_STATE_NODE_NOTIFICATION_ERROR: case FAILOVER_STATE_NODE_NOTIFICATION_ERROR:
return "NODE_NOTIFICATION_ERROR"; return "NODE_NOTIFICATION_ERROR";
case FAILOVER_STATE_ELECTION_RERUN:
return "ELECTION_RERUN";
} }
/* should never reach here */ /* should never reach here */
@@ -3600,3 +3880,95 @@ handle_sighup(PGconn **conn, t_server_type server_type)
got_SIGHUP = false; got_SIGHUP = false;
} }
static ElectionResult
execute_failover_validation_command(t_node_info *node_info)
{
PQExpBufferData failover_validation_command;
PQExpBufferData command_output;
int return_value = -1;
initPQExpBuffer(&failover_validation_command);
initPQExpBuffer(&command_output);
parse_failover_validation_command(config_file_options.failover_validation_command,
node_info,
&failover_validation_command);
log_notice(_("executing \"failover_validation_command\""));
log_detail("%s", failover_validation_command.data);
/* we determine success of the command by the value placed into return_value */
(void) local_command_return_value(failover_validation_command.data,
&command_output,
&return_value);
termPQExpBuffer(&failover_validation_command);
if (command_output.data[0] != '\0')
{
log_info("output returned by failover validation command:\n%s", command_output.data);
}
else
{
log_info(_("no output returned from command"));
}
termPQExpBuffer(&command_output);
if (return_value != 0)
{
/* create event here? */
log_notice(_("failover validation command returned a non-zero value: %i"),
return_value);
return ELECTION_RERUN;
}
log_notice(_("failover validation command returned zero"));
return ELECTION_WON;
}
static void
parse_failover_validation_command(const char *template, t_node_info *node_info, PQExpBufferData *out)
{
const char *src_ptr;
for (src_ptr = template; *src_ptr; src_ptr++)
{
if (*src_ptr == '%')
{
switch (src_ptr[1])
{
case '%':
/* %%: replace with % */
src_ptr++;
appendPQExpBufferChar(out, *src_ptr);
break;
case 'n':
/* %n: node id */
src_ptr++;
appendPQExpBuffer(out, "%i", node_info->node_id);
break;
case 'a':
/* %a: node name */
src_ptr++;
appendPQExpBufferStr(out, node_info->node_name);
break;
default:
/* otherwise treat the % as not special */
appendPQExpBufferChar(out, *src_ptr);
break;
}
}
else
{
appendPQExpBufferChar(out, *src_ptr);
}
}
return;
}

View File

@@ -383,6 +383,15 @@ main(int argc, char **argv)
* repmgr has not been properly configured. * repmgr has not been properly configured.
*/ */
/* warn about any settings which might not be relevant for the current PostgreSQL version */
if (config_file_options.standby_disconnect_on_failover == true && PQserverVersion(local_conn) < 90500)
{
log_warning(_("\"standby_disconnect_on_failover\" specified, but not available for this PostgreSQL version"));
/* TODO: format server version */
log_detail(_("available from PostgreSQL 9.5, this PostgreSQL version is %i"), PQserverVersion(local_conn));
}
/* Check "repmgr" the extension is installed */ /* Check "repmgr" the extension is installed */
extension_status = get_repmgr_extension_status(local_conn, &extversions); extension_status = get_repmgr_extension_status(local_conn, &extversions);
@@ -818,6 +827,81 @@ show_help(void)
} }
bool
check_upstream_connection(PGconn **conn, const char *conninfo)
{
/* Check the connection status twice in case it changes after reset */
bool twice = false;
if (config_file_options.connection_check_type == CHECK_PING)
return is_server_available(conninfo);
if (config_file_options.connection_check_type == CHECK_CONNECTION)
{
bool success = true;
PGconn *test_conn = PQconnectdb(conninfo);
log_debug("check_upstream_connection(): attempting to connect to \"%s\"", conninfo);
if (PQstatus(test_conn) != CONNECTION_OK)
{
log_warning(_("unable to connect to \"%s\""), conninfo);
success = false;
}
PQfinish(test_conn);
return success;
}
for (;;)
{
if (PQstatus(*conn) != CONNECTION_OK)
{
log_debug("check_upstream_connection(): connection not OK");
if (twice)
return false;
/* reconnect */
PQfinish(*conn);
*conn = PQconnectdb(conninfo);
twice = true;
}
else
{
if (!cancel_query(*conn, config_file_options.async_query_timeout))
goto failed;
if (wait_connection_availability(*conn, config_file_options.async_query_timeout) != 1)
goto failed;
/* execute a simple query to verify connection availability */
if (PQsendQuery(*conn, "SELECT 1") == 0)
{
log_warning(_("unable to send query to upstream"));
log_detail("%s", PQerrorMessage(*conn));
goto failed;
}
if (wait_connection_availability(*conn, config_file_options.async_query_timeout) != 1)
goto failed;
break;
failed:
/* retry once */
if (twice)
return false;
/* reconnect */
PQfinish(*conn);
*conn = PQconnectdb(conninfo);
twice = true;
}
}
return true;
}
void void
try_reconnect(PGconn **conn, t_node_info *node_info) try_reconnect(PGconn **conn, t_node_info *node_info)
{ {
@@ -843,8 +927,7 @@ try_reconnect(PGconn **conn, t_node_info *node_info)
node_info->node_id, i + 1, max_attempts); node_info->node_id, i + 1, max_attempts);
if (is_server_available_params(&conninfo_params) == true) if (is_server_available_params(&conninfo_params) == true)
{ {
log_notice(_("node %i has recovered, reconnecting"), node_info->node_id);
log_notice(_("node has recovered, reconnecting"));
/* /*
* XXX we should also handle the case where node is pingable but * XXX we should also handle the case where node is pingable but
@@ -874,7 +957,7 @@ try_reconnect(PGconn **conn, t_node_info *node_info)
if (ping_result != PGRES_TUPLES_OK) if (ping_result != PGRES_TUPLES_OK)
{ {
log_info("original connnection no longer available, using new connection"); log_info("original connection no longer available, using new connection");
close_connection(conn); close_connection(conn);
*conn = our_conn; *conn = our_conn;
} }

View File

@@ -23,6 +23,7 @@ extern PGconn *local_conn;
extern bool startup_event_logged; extern bool startup_event_logged;
extern char pid_file[MAXPGPATH]; extern char pid_file[MAXPGPATH];
bool check_upstream_connection(PGconn **conn, const char *conninfo);
void try_reconnect(PGconn **conn, t_node_info *node_info); void try_reconnect(PGconn **conn, t_node_info *node_info);
int calculate_elapsed(instr_time start_time); int calculate_elapsed(instr_time start_time);
@@ -31,5 +32,4 @@ const char *print_monitoring_state(MonitoringState monitoring_state);
void update_registration(PGconn *conn); void update_registration(PGconn *conn);
void terminate(int retval); void terminate(int retval);
#endif /* _REPMGRD_H_ */ #endif /* _REPMGRD_H_ */

358
sysutils.c Normal file
View File

@@ -0,0 +1,358 @@
/*
* sysutils.c
*
* Copyright (c) 2ndQuadrant, 2010-2019
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
#include <signal.h>
#include "repmgr.h"
static bool _local_command(const char *command, PQExpBufferData *outputbuf, bool simple, int *return_value);
/*
* Execute a command locally. "outputbuf" should either be an
* initialised PQExpPuffer, or NULL
*/
bool
local_command(const char *command, PQExpBufferData *outputbuf)
{
return _local_command(command, outputbuf, false, NULL);
}
bool
local_command_return_value(const char *command, PQExpBufferData *outputbuf, int *return_value)
{
return _local_command(command, outputbuf, false, return_value);
}
bool
local_command_simple(const char *command, PQExpBufferData *outputbuf)
{
return _local_command(command, outputbuf, true, NULL);
}
static bool
_local_command(const char *command, PQExpBufferData *outputbuf, bool simple, int *return_value)
{
FILE *fp = NULL;
char output[MAXLEN];
int retval = 0;
bool success;
log_verbose(LOG_DEBUG, "executing:\n %s", command);
if (outputbuf == NULL)
{
retval = system(command);
if (return_value != NULL)
*return_value = WEXITSTATUS(retval);
return (retval == 0) ? true : false;
}
fp = popen(command, "r");
if (fp == NULL)
{
log_error(_("unable to execute local command:\n%s"), command);
return false;
}
while (fgets(output, MAXLEN, fp) != NULL)
{
appendPQExpBufferStr(outputbuf, output);
if (!feof(fp) && simple == false)
{
break;
}
}
retval = pclose(fp);
/* */
success = (WEXITSTATUS(retval) == 0 || WEXITSTATUS(retval) == 141) ? true : false;
log_verbose(LOG_DEBUG, "result of command was %i (%i)", WEXITSTATUS(retval), retval);
if (return_value != NULL)
*return_value = WEXITSTATUS(retval);
if (outputbuf->data != NULL && outputbuf->data[0] != '\0')
log_verbose(LOG_DEBUG, "local_command(): output returned was:\n%s", outputbuf->data);
else
log_verbose(LOG_DEBUG, "local_command(): no output returned");
return success;
}
/*
* Execute a command via ssh on the remote host.
*
* TODO: implement SSH calls using libssh2.
*/
bool
remote_command(const char *host, const char *user, const char *command, const char *ssh_options, PQExpBufferData *outputbuf)
{
FILE *fp;
char ssh_command[MAXLEN] = "";
PQExpBufferData ssh_host;
char output[MAXLEN] = "";
initPQExpBuffer(&ssh_host);
if (*user != '\0')
{
appendPQExpBuffer(&ssh_host, "%s@", user);
}
appendPQExpBufferStr(&ssh_host, host);
maxlen_snprintf(ssh_command,
"ssh -o Batchmode=yes %s %s %s",
ssh_options,
ssh_host.data,
command);
termPQExpBuffer(&ssh_host);
log_debug("remote_command():\n %s", ssh_command);
fp = popen(ssh_command, "r");
if (fp == NULL)
{
log_error(_("unable to execute remote command:\n %s"), ssh_command);
return false;
}
if (outputbuf != NULL)
{
/* TODO: better error handling */
while (fgets(output, MAXLEN, fp) != NULL)
{
appendPQExpBufferStr(outputbuf, output);
}
}
else
{
while (fgets(output, MAXLEN, fp) != NULL)
{
if (!feof(fp))
{
break;
}
}
}
pclose(fp);
if (outputbuf != NULL)
{
if (outputbuf->data != NULL && outputbuf->data[0] != '\0')
log_verbose(LOG_DEBUG, "remote_command(): output returned was:\n%s", outputbuf->data);
else
log_verbose(LOG_DEBUG, "remote_command(): no output returned");
}
return true;
}
pid_t
disable_wal_receiver(PGconn *conn)
{
char buf[MAXLEN];
int wal_retrieve_retry_interval, new_wal_retrieve_retry_interval;
pid_t wal_receiver_pid = UNKNOWN_PID;
int kill_ret;
int i, j;
int max_retries = 2;
if (is_superuser_connection(conn, NULL) == false)
{
log_error(_("superuser connection required"));
return UNKNOWN_PID;
}
if (get_recovery_type(conn) == RECTYPE_PRIMARY)
{
log_error(_("node is not in recovery"));
log_detail(_("wal receiver can only run on standby nodes"));
return UNKNOWN_PID;
}
wal_receiver_pid = (pid_t)get_wal_receiver_pid(conn);
if (wal_receiver_pid == UNKNOWN_PID)
{
log_warning(_("unable to retrieve wal receiver PID"));
return UNKNOWN_PID;
}
get_pg_setting(conn, "wal_retrieve_retry_interval", buf);
/* TODO: potentially handle atoi error, though unlikely at this point */
wal_retrieve_retry_interval = atoi(buf);
new_wal_retrieve_retry_interval = wal_retrieve_retry_interval + WALRECEIVER_DISABLE_TIMEOUT_VALUE;
if (wal_retrieve_retry_interval < WALRECEIVER_DISABLE_TIMEOUT_VALUE)
{
log_notice(_("setting \"wal_retrieve_retry_interval\" to %i milliseconds"),
new_wal_retrieve_retry_interval);
alter_system_int(conn, "wal_retrieve_retry_interval", new_wal_retrieve_retry_interval);
pg_reload_conf(conn);
}
/*
* If, at this point, the WAL receiver is not running, we don't need to (and indeed can't)
* kill it.
*/
if (wal_receiver_pid == 0)
{
log_warning(_("wal receiver not running"));
return UNKNOWN_PID;
}
/* why 5? */
log_info(_("sleeping 5 seconds"));
sleep(5);
/* see comment below as to why we need a loop here */
for (i = 0; i < max_retries; i++)
{
log_notice(_("killing WAL receiver with PID %i"), (int)wal_receiver_pid);
kill((int)wal_receiver_pid, SIGTERM);
for (j = 0; j < 30; j++)
{
kill_ret = kill(wal_receiver_pid, 0);
if (kill_ret != 0)
{
log_info(_("WAL receiver with pid %i killed"), (int)wal_receiver_pid);
break;
}
sleep(1);
}
/*
* Wait briefly to check that the WAL receiver has indeed gone away -
* for reasons as yet unclear, after a server start/restart, immediately
* after the first time a WAL receiver is killed, a new one is started
* straight away, so we'll need to kill that too.
*/
sleep(1);
wal_receiver_pid = (pid_t)get_wal_receiver_pid(conn);
if (wal_receiver_pid == UNKNOWN_PID || wal_receiver_pid == 0)
break;
}
return wal_receiver_pid;
}
pid_t
enable_wal_receiver(PGconn *conn, bool wait_startup)
{
char buf[MAXLEN];
int wal_retrieve_retry_interval;
pid_t wal_receiver_pid = UNKNOWN_PID;
/* make timeout configurable */
int i, timeout = 30;
if (is_superuser_connection(conn, NULL) == false)
{
log_error(_("superuser connection required"));
return UNKNOWN_PID;
}
if (get_recovery_type(conn) == RECTYPE_PRIMARY)
{
log_error(_("node is not in recovery"));
log_detail(_("wal receiver can only run on standby nodes"));
return UNKNOWN_PID;
}
if (get_pg_setting(conn, "wal_retrieve_retry_interval", buf) == false)
{
log_error(_("unable to retrieve \"wal_retrieve_retry_interval\""));
return UNKNOWN_PID;
}
/* TODO: potentially handle atoi error, though unlikely at this point */
wal_retrieve_retry_interval = atoi(buf);
if (wal_retrieve_retry_interval > WALRECEIVER_DISABLE_TIMEOUT_VALUE)
{
int new_wal_retrieve_retry_interval = wal_retrieve_retry_interval - WALRECEIVER_DISABLE_TIMEOUT_VALUE;
log_notice(_("setting \"wal_retrieve_retry_interval\" to %i ms"),
new_wal_retrieve_retry_interval);
// XXX handle error
alter_system_int(conn,
"wal_retrieve_retry_interval",
new_wal_retrieve_retry_interval);
pg_reload_conf(conn);
}
else
{
// XXX add threshold sanity check
log_info(_("\"wal_retrieve_retry_interval\" is %i, not changing"),
wal_retrieve_retry_interval);
}
if (wait_startup == false)
return UNKNOWN_PID;
for (i = 0; i < timeout; i++)
{
wal_receiver_pid = (pid_t)get_wal_receiver_pid(conn);
if (wal_receiver_pid > 0)
break;
log_info(_("sleeping %i of maximum %i seconds waiting for WAL receiver to start up"),
i + 1, timeout)
sleep(1);
}
if (wal_receiver_pid == UNKNOWN_PID)
{
log_warning(_("unable to retrieve WAL receiver PID"));
return UNKNOWN_PID;
}
else if (wal_receiver_pid == 0)
{
log_error(_("WAL receiver did not start up after %i seconds"), timeout);
return UNKNOWN_PID;
}
log_info(_("WAL receiver started up with PID %i"), (int)wal_receiver_pid);
return wal_receiver_pid;
}

32
sysutils.h Normal file
View File

@@ -0,0 +1,32 @@
/*
* sysutils.h
* Copyright (c) 2ndQuadrant, 2010-2019
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef _SYSUTILS_H_
#define _SYSUTILS_H_
extern bool local_command(const char *command, PQExpBufferData *outputbuf);
extern bool local_command_return_value(const char *command, PQExpBufferData *outputbuf, int *return_value);
extern bool local_command_simple(const char *command, PQExpBufferData *outputbuf);
extern bool remote_command(const char *host, const char *user, const char *command, const char *ssh_options, PQExpBufferData *outputbuf);
extern pid_t disable_wal_receiver(PGconn *conn);
extern pid_t enable_wal_receiver(PGconn *conn, bool wait_startup);
#endif /* _SYSUTILS_H_ */