diff --git a/doc/filelist.sgml b/doc/filelist.sgml index 246b7504..0ac4507a 100644 --- a/doc/filelist.sgml +++ b/doc/filelist.sgml @@ -49,6 +49,9 @@ + + + diff --git a/doc/repmgr-cluster-cleanup.sgml b/doc/repmgr-cluster-cleanup.sgml index bafc34f1..df207d0c 100644 --- a/doc/repmgr-cluster-cleanup.sgml +++ b/doc/repmgr-cluster-cleanup.sgml @@ -16,7 +16,8 @@ Monitoring history will only be written if repmgrd is active, and - monitoring_history is set to true in repmgr.conf. + monitoring_history is set to true in + repmgr.conf. diff --git a/doc/repmgr.sgml b/doc/repmgr.sgml index 475f42f6..989efda0 100644 --- a/doc/repmgr.sgml +++ b/doc/repmgr.sgml @@ -81,6 +81,9 @@ &repmgrd-automatic-failover; &repmgrd-configuration; &repmgrd-demonstration; + &repmgrd-cascading-replication; + &repmgrd-network-split; + &repmgrd-degraded-monitoring; &repmgrd-monitoring; diff --git a/doc/repmgrd-cascading-replication.sgml b/doc/repmgrd-cascading-replication.sgml new file mode 100644 index 00000000..b8e00514 --- /dev/null +++ b/doc/repmgrd-cascading-replication.sgml @@ -0,0 +1,17 @@ + + repmgrd and cascading replication + + Cascading replication - where a standby can connect to an upstream node and not + the primary server itself - was introduced in PostgreSQL 9.2. &repmgr; and + repmgrd support cascading replication by keeping track of the relationship + between standby servers - each node record is stored with the node id of its + upstream ("parent") server (except of course the primary server). + + + In a failover situation where the primary node fails and a top-level standby + is promoted, a standby connected to another standby will not be affected + and continue working as normal (even if the upstream standby it's connected + to becomes the primary node). If however the node's direct upstream fails, + the "cascaded standby" will attempt to reconnect to that node's parent. + + diff --git a/doc/repmgrd-degraded-monitoring.sgml b/doc/repmgrd-degraded-monitoring.sgml new file mode 100644 index 00000000..adae7236 --- /dev/null +++ b/doc/repmgrd-degraded-monitoring.sgml @@ -0,0 +1,69 @@ + + "degraded monitoring" mode + + In certain circumstances, `repmgrd` is not able to fulfill its primary mission + of monitoring the nodes' upstream server. In these cases it enters "degraded + monitoring" mode, where `repmgrd` remains active but is waiting for the situation + to be resolved. + + + Situations where this happens are: + + + + a failover situation has occurred, no nodes in the primary node's location are visible + + + + a failover situation has occurred, but no promotion candidate is available + + + + a failover situation has occurred, but the promotion candidate could not be promoted + + + + a failover situation has occurred, but the node was unable to follow the new primary + + + + a failover situation has occurred, but no primary has become available + + + + a failover situation has occurred, but automatic failover is not enabled for the node + + + + repmgrd is monitoring the primary node, but it is not available + + + + + + Example output in a situation where there is only one standby with failover=manual, + and the primary node is unavailable (but is later restarted): + + [2017-08-29 10:59:19] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled) + [2017-08-29 10:59:33] [WARNING] unable to connect to upstream node "node1" (node ID: 1) + [2017-08-29 10:59:33] [INFO] checking state of node 1, 1 of 5 attempts + [2017-08-29 10:59:33] [INFO] sleeping 1 seconds until next reconnection attempt + (...) + [2017-08-29 10:59:37] [INFO] checking state of node 1, 5 of 5 attempts + [2017-08-29 10:59:37] [WARNING] unable to reconnect to node 1 after 5 attempts + [2017-08-29 10:59:37] [NOTICE] this node is not configured for automatic failover so will not be considered as promotion candidate + [2017-08-29 10:59:37] [NOTICE] no other nodes are available as promotion candidate + [2017-08-29 10:59:37] [HINT] use "repmgr standby promote" to manually promote this node + [2017-08-29 10:59:37] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled) + [2017-08-29 10:59:53] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in degraded state (automatic failover disabled) + [2017-08-29 11:00:45] [NOTICE] reconnected to upstream node 1 after 68 seconds, resuming monitoring + [2017-08-29 11:00:57] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state (automatic failover disabled) + + + + By default, repmgrd will continue in degraded monitoring mode indefinitely. + However a timeout (in seconds) can be set with degraded_monitoring_timeout. + + + + diff --git a/doc/repmgrd-network-split.sgml b/doc/repmgrd-network-split.sgml new file mode 100644 index 00000000..934bf0b8 --- /dev/null +++ b/doc/repmgrd-network-split.sgml @@ -0,0 +1,43 @@ + + Handling network splits with repmgrd + + A common pattern for replication cluster setups is to spread servers over + more than one datacentre. This can provide benefits such as geographically- + distributed read replicas and DR (disaster recovery capability). However + this also means there is a risk of disconnection at network level between + datacentre locations, which would result in a split-brain scenario if + servers in a secondary data centre were no longer able to see the primary + in the main data centre and promoted a standby among themselves. + + + Previous &repmgr; versions used the concept of a "witness server" to + artificially create a quorum of servers in a particular location, ensuring + that nodes in another location will not elect a new primary if they + are unable to see the majority of nodes. However this approach does not + scale well, particularly with more complex replication setups, e.g. + where the majority of nodes are located outside of the primary datacentre. + It also means the witness node needs to be managed as an + extra PostgreSQL instance outside of the main replication cluster, which + adds administrative and programming complexity. + + + repmgr4 introduces the concept of location: + each node is associated with an arbitrary location string (default is + default); this is set in repmgr.conf, e.g.: + + node_id=1 + node_name=node1 + conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2' + data_directory='/var/lib/postgresql/data' + location='dc1' + + + In a failover situation, repmgrd will check if any servers in the + same location as the current primary node are visible. If not, repmgrd + will assume a network interruption and not promote any node in any + other location (it will however enter mode until + a primary becomes visible). + + + +