From e7bb3e9d50807f4ec87fc3fb4c64f115d6ada8aa Mon Sep 17 00:00:00 2001 From: Ian Barwick Date: Fri, 22 Sep 2017 16:29:14 +0900 Subject: [PATCH] Add section on promoting standby --- doc/cloning-standbys.sgml | 2 +- doc/command-reference.sgml | 206 ++++++++++++++++++++++++++++++++++++- doc/filelist.sgml | 1 + doc/promoting-standby.sgml | 74 +++++++++++++ doc/repmgr.sgml | 1 + 5 files changed, 282 insertions(+), 2 deletions(-) create mode 100644 doc/promoting-standby.sgml diff --git a/doc/cloning-standbys.sgml b/doc/cloning-standbys.sgml index 05b513a8..2c1fe095 100644 --- a/doc/cloning-standbys.sgml +++ b/doc/cloning-standbys.sgml @@ -308,7 +308,7 @@ After starting the standby, the cluster will look like this, showing that node3 - is attached to node3, not the primary (node1). + is attached to node2, not the primary (node1). $ repmgr -f /etc/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Connection string diff --git a/doc/command-reference.sgml b/doc/command-reference.sgml index 2627c5a5..6fb419f9 100644 --- a/doc/command-reference.sgml +++ b/doc/command-reference.sgml @@ -148,6 +148,26 @@ + + + repmgr standby promote + + repmgr standby promote + + Promotes a standby to a primary if the current primary has failed. This + command requires a valid repmgr.conf file for the standby, either + specified explicitly with -f/--config-file or located in a + default location; no additional arguments are required. + + + If the standby promotion succeeds, the server will not need to be + restarted. However any other standbys will need to follow the new server, + by using ; if repmgrd is active, it will + handle this automatically. + + + + repmgr standby follow @@ -170,6 +190,7 @@ + repmgr node rejoin @@ -179,8 +200,191 @@ Enables a dormant (stopped) node to be rejoined to the replication cluster. - This can optionally use `pg_rewind` to re-integrate a node which has diverged + This can optionally use pg_rewind to re-integrate a node which has diverged from the rest of the cluster, typically a failed primary. + + + + repmgr cluster show + + repmgr cluster show + + Displays information about each active node in the replication cluster. This + command polls each registered server and shows its role (primary / + standby / bdr) and status. It polls each server + directly and can be run on any node in the cluster; this is also useful when analyzing + connectivity from a particular node. + + + This command requires either a valid repmgr.conf file or a database + connection string to one of the registered nodes; no additional arguments are needed. + + + + Example: + + $ repmgr -f /etc/repmgr.conf cluster show + + ID | Name | Role | Status | Upstream | Location | Connection string + ----+-------+---------+-----------+----------+----------+----------------------------------------- + 1 | node1 | primary | * running | | default | host=db_node1 dbname=repmgr user=repmgr + 2 | node2 | standby | running | node1 | default | host=db_node2 dbname=repmgr user=repmgr + 3 | node3 | standby | running | node1 | default | host=db_node3 dbname=repmgr user=repmgr + + + + To show database connection errors when polling nodes, run the command in + --verbose mode. + + + The `cluster show` command accepts an optional parameter --csv, which + outputs the replication cluster's status in a simple CSV format, suitable for + parsing by scripts: + + $ repmgr -f /etc/repmgr.conf cluster show --csv + 1,-1,-1 + 2,0,0 + 3,0,1 + + + The columns have following meanings: + + + + node ID + + + availability (0 = available, -1 = unavailable) + + + recovery state (0 = not in recovery, 1 = in recovery, -1 = unknown) + + + + + + + Note that the availability is tested by connecting from the node where + repmgr cluster show is executed, and does not necessarily imply the node + is down. See and to get + a better overviews of connections between nodes. + + + + + + repmgr cluster matrix + + repmgr cluster matric + + repmgr cluster matrix runs repmgr cluster show on each + node and arranges the results in a matrix, recording success or failure. + + + repmgr cluster matrix requires a valid repmgr.conf + file on each node. Additionally passwordless `ssh` connections are required between + all nodes. + + + Example 1 (all nodes up): + + $ repmgr -f /etc/repmgr.conf cluster matrix + + Name | Id | 1 | 2 | 3 + -------+----+----+----+---- + node1 | 1 | * | * | * + node2 | 2 | * | * | * + node3 | 3 | * | * | * + + + Example 2 (node1 and node2 up, node3 down): + + $ repmgr -f /etc/repmgr.conf cluster matrix + + Name | Id | 1 | 2 | 3 + -------+----+----+----+---- + node1 | 1 | * | * | x + node2 | 2 | * | * | x + node3 | 3 | ? | ? | ? + + + + Each row corresponds to one server, and indicates the result of + testing an outbound connection from that server. + + + Since node3 is down, all the entries in its row are filled with + ?, meaning that there we cannot test outbound connections. + + + The other two nodes are up; the corresponding rows have x in the + column corresponding to node3, meaning that inbound connections to + that node have failed, and `*` in the columns corresponding to + node1 and node2, meaning that inbound connections + to these nodes have succeeded. + + + Example 3 (all nodes up, firewall dropping packets originating + from node1 and directed to port 5432 on node3) - + running repmgr cluster matrix from node1 gives the following output: + + $ repmgr -f /etc/repmgr.conf cluster matrix + + Name | Id | 1 | 2 | 3 + -------+----+----+----+---- + node1 | 1 | * | * | x + node2 | 2 | * | * | * + node3 | 3 | ? | ? | ? + + + Note this may take some time depending on the connect_timeout + setting in the node conninfo strings; default is + 1 minute which means without modification the above + command would take around 2 minutes to run; see comment elsewhere about setting + connect_timeout) + + + The matrix tells us that we cannot connect from node1 to node3, + and that (therefore) we don't know the state of any outbound + connection from node3. + + + In this case, the command will produce a more + useful result. + + + + + + + repmgr cluster crosscheck + + repmgr cluster crosscheck + + repmgr cluster crosscheck is similar to , + but cross-checks connections between each combination of nodes. In "Example 3" in + we have no information about the state of node3. + However by running repmgr cluster crosscheck it's possible to get a better + overview of the cluster situation: + + $ repmgr -f /etc/repmgr.conf cluster crosscheck + + Name | Id | 1 | 2 | 3 + -------+----+----+----+---- + node1 | 1 | * | * | x + node2 | 2 | * | * | * + node3 | 3 | * | * | * + + + What happened is that repmgr cluster crosscheck merged its own + repmgr cluster matrix with the repmgr cluster matrix + output from node2; the latter is able to connect to node3 + and therefore determine the state ofx outbound connections from that node. + + + + + diff --git a/doc/filelist.sgml b/doc/filelist.sgml index e4e2a6b8..5d16624a 100644 --- a/doc/filelist.sgml +++ b/doc/filelist.sgml @@ -40,6 +40,7 @@ + diff --git a/doc/promoting-standby.sgml b/doc/promoting-standby.sgml new file mode 100644 index 00000000..de515951 --- /dev/null +++ b/doc/promoting-standby.sgml @@ -0,0 +1,74 @@ + + Promoting a standby server with repmgr + + If a primary server fails or needs to be removed from the replication cluster, + a new primary server must be designated, to ensure the cluster continues + to function correctly. This can be done with , + which promotes the standby on the current server to primary. + + + + To demonstrate this, set up a replication cluster with a primary and two attached + standby servers so that the cluster looks like this: + + $ repmgr -f /etc/repmgr.conf cluster show + ID | Name | Role | Status | Upstream | Location | Connection string + ----+-------+---------+-----------+----------+----------+-------------------------------------- + 1 | node1 | primary | * running | | default | host=node1 dbname=repmgr user=repmgr + 2 | node2 | standby | running | node1 | default | host=node2 dbname=repmgr user=repmgr + 3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr + + + Stop the current primary with e.g.: + + $ pg_ctl -D /var/lib/postgresql/data -m fast stop + + + At this point the replication cluster will be in a partially disabled state, with + both standbys accepting read-only connections while attempting to connect to the + stopped primary. Note that the &repmgr; metadata table will not yet have been updated; + executing will note the discrepancy: + + $ repmgr -f /etc/repmgr.conf cluster show + ID | Name | Role | Status | Upstream | Location | Connection string + ----+-------+---------+---------------+----------+----------+-------------------------------------- + 1 | node1 | primary | ? unreachable | | default | host=node1 dbname=repmgr user=repmgr + 2 | node2 | standby | running | node1 | default | host=node2 dbname=repmgr user=repmgr + 3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr + + WARNING: following issues were detected + node "node1" (ID: 1) is registered as an active primary but is unreachable + + + Now promote the first standby with: + + $ repmgr -f /etc/repmgr.conf standby promote + + + This will produce output similar to the following: + + INFO: connecting to standby database + NOTICE: promoting standby + DETAIL: promoting server using "pg_ctl -l /var/log/postgresql/startup.log -w -D '/var/lib/postgresql/data' promote" + server promoting + INFO: reconnecting to promoted server + NOTICE: STANDBY PROMOTE successful + DETAIL: node 2 was successfully promoted to primary + + + Executing will show the current state; as there is now an + active primary, the previous warning will not be displayed: + + $ repmgr -f /etc/repmgr.conf cluster show + ID | Name | Role | Status | Upstream | Location | Connection string + ----+-------+---------+-----------+----------+----------+-------------------------------------- + 1 | node1 | primary | - failed | | default | host=node1 dbname=repmgr user=repmgr + 2 | node2 | primary | * running | | default | host=node2 dbname=repmgr user=repmgr + 3 | node3 | standby | running | node1 | default | host=node3 dbname=repmgr user=repmgr + + + However the sole remaining standby (node3) is still trying to replicate from the failed + primary; must now be executed to rectify this situation. + + + diff --git a/doc/repmgr.sgml b/doc/repmgr.sgml index a8a8b055..a6b17330 100644 --- a/doc/repmgr.sgml +++ b/doc/repmgr.sgml @@ -69,6 +69,7 @@ &configuration; &cloning-standbys; + &promoting-standby; &command-reference;