Update repmgrd documentation

This commit is contained in:
Ian Barwick
2017-08-29 11:04:30 +09:00
parent 0b9c538cb9
commit df827c6518
3 changed files with 98 additions and 14 deletions

View File

@@ -1209,7 +1209,6 @@ Additionally the following `repmgrd` options *must* be set in `repmgr.conf`
promote_command='repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='repmgr standby follow -f /etc/repmgr.conf --log-to-file'
Note that the `--log-to-file` option will cause `repmgr`'s output to be logged to
the destination configured to receive log output for `repmgrd`.
See `repmgr.conf.sample` for further `repmgrd`-specific settings
@@ -1380,12 +1379,62 @@ node, e.g. recovering WAL from an archive, `apply_lag` will always appear as
> constant stream of replication activity which may not be desirable. To prevent
> this, convert the table to an `UNLOGGED` one with:
>
> ALTER TABLE repmgr.monitoring_history SET UNLOGGED ;
> ALTER TABLE repmgr.monitoring_history SET UNLOGGED;
>
> This will however mean that monitoring history will not be available on
> another node following a failover, and the view `replication_status`
> another node following a failover, and the view `repmgr.replication_status`
> will not work on standbys.
### `repmgrd` log output
In normal operation, `repmgrd` remains passive until a connection issue
with either the upstream or local node is detected. Otherwise there's not
much to log, so to confirm `repmgrd` is actually running, it emits log lines
like this at regular intervals:
...
[2017-08-28 08:51:27] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2017-08-28 08:51:43] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
...
Timing of these entries is determined by the configuration file setting
`log_status_interval`, which specifies the interval in seconds (default: `300`).
### "degraded monitoring" mode
In certain circumstances, `repmgrd` is not able to fulfill its primary mission
of monitoring the nodes' upstream server. In these cases it enters "degraded
monitoring" mode, where `repmgrd` remains active but is waiting for the situation
to be resolved.
Cases where this happens are:
- a failover situation has occurred, no nodes in the primary node's location are visible
- a failover situation has occurred, but no promotion candidate is available
- a failover situation has occurred, but the promotion candidate could not be promoted
- a failover situation has occurred, but the node was unable to follow the new primary
- a failover situation has occurred, but no primary has become available
- a failover situation has occurred, but automatic failover is not enabled for the node
- repmgrd is monitoring the primary node, but it is not available
Example output in a situation where there is only one standby with `failover=manual`,
and the primary node is unavailable (but is later restarted):
[2017-08-29 10:59:19] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled)
[2017-08-29 10:59:33] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
[2017-08-29 10:59:33] [INFO] checking state of node 1, 1 of 5 attempts
[2017-08-29 10:59:33] [INFO] sleeping 1 seconds until next reconnection attempt
(...)
[2017-08-29 10:59:37] [INFO] checking state of node 1, 5 of 5 attempts
[2017-08-29 10:59:37] [WARNING] unable to reconnect to node 1 after 5 attempts
[2017-08-29 10:59:37] [NOTICE] this node is not configured for automatic failover so will not be considered as promotion candidate
[2017-08-29 10:59:37] [NOTICE] no other nodes are available as promotion candidate
[2017-08-29 10:59:37] [HINT] use "repmgr standby promote" to manually promote this node
[2017-08-29 10:59:37] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled)
[2017-08-29 10:59:53] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled)
[2017-08-29 11:00:45] [NOTICE] reconnected to upstream node 1 after 68 seconds, resuming monitoring
[2017-08-29 11:00:57] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled)
### `repmgrd` log rotation
To ensure the current `repmgrd` logfile does not grow indefinitely, configure
@@ -1393,7 +1442,7 @@ your system's `logrotate` to do this. Sample configuration to rotate logfiles
weekly with retention for up to 52 weeks and rotation forced if a file grows
beyond 100Mb:
/var/log/postgresql/repmgr-9.5.log {
/var/log/postgresql/repmgr-9.6.log {
missingok
compress
rotate 52
@@ -1418,6 +1467,45 @@ and continue working as normal (even if the upstream standby it's connected
to becomes the master node). If however the node's direct upstream fails,
the "cascaded standby" will attempt to reconnect to that node's parent.
Handling network splits with `repmgrd`
--------------------------------------
A common pattern for replication cluster setups is to spread servers over
more than one datacentre. This can provide benefits such as geographically-
distributed read replicas and DR (disaster recovery capability). However
this also means there is a risk of disconnection at network level between
datacentre locations, which would result in a split-brain scenario if
servers in a secondary data centre were no longer able to see the primary
in the main data centre and promoted a standby among themselves.
Previous `repmgr` versions used the concept of a `witness server` to
artificially create a quorum of servers in a particular location, ensuring
that nodes in another location will not elect a new primary if they
are unable to see the majority of nodes. However this approach does not
scale well, particularly with more complex replication setups, e.g.
where the majority of nodes are located outside of the primary datacentre.
It also means the `witness` node needs to be managed as an extra PostgreSQL
outside of the main replication cluster, which adds administrative and
programming complexity.
`repmgr4` introduces the concept of `location`: each node is associated
with an arbitrary location string (default is `default`); this is set
in `repmgr.conf`, e.g.:
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/data'
location='dc1'
In a failover situation, `repmgrd` will check if any servers in the
same location as the current primary node are visible. If not, `repmgrd`
will assume a network interruption and not promote any node in any
other location (it will however enter "degraded monitoring" mode until
a primary becomes visible.
Reference
---------

View File

@@ -49,16 +49,12 @@
# Replication settings
#------------------------------------------------------------------------------
#replication_type=physical # Must be one of 'physical' or 'bdr'
#upstream_node_id= # When using cascading replication, a standby
# can connect to another upstream standby node
# which is specified by setting 'upstream_node_id'.
# In that case, the upstream node must exist
# before the new standby can be registered. If
# 'upstream_node_id' is not set, then the standby
# will connect directly to the primary node.
#replication_type=physical # Must be one of 'physical' or 'bdr'.
#location=default # arbitrary string defining the location of the node; this
# is used during failover to check visibilty of the
# current primary node. See the 'repmgrd' documentation
# in README.md for further details.
#use_replication_slots=no # whether to use physical replication slots
# NOTE: when using replication slots,

View File

@@ -828,7 +828,7 @@ monitor_streaming_standby(void)
appendPQExpBuffer(
&event_details,
_("reconnect to local node \"%s\" (ID: %i), marking active"),
_("reconnected to local node \"%s\" (ID: %i), marking active"),
local_node_info.node_name,
local_node_info.node_id);